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The  purpose  of  this  study  was  to  investigate  the  effects  of  various  types  of 
multidimensional  tests  on  an  item  bias  detection  index  based  on 
unidimensional  item  response  theory  (IRT).  Data  for  two  types  of 
multidimensional  structures  were  simulated:  one  with  one  dominant  trait  and 
the  other  with  two  dominant  traits.  In  each  type  of  dimensionality,  both  no-bias 
and  bias  situations  were  considered.  Factors  that  were  experimentally 
manipulated  were  (a)  the  percentage  of  multidimensional  items  when  the  data 
were  dominantly  unidimensional,  (b)  the  between-group  mean  difference  on 
dominant  trait(s),  and  (c)  the  between-group  correlation  difference  when  there 
were  two  dominant  traits. 

The  42  data  sets,  each  set  having  two  subgroups  (reference  and  focal 
groups),  were  simulated  with  1000  simulees  and  40  items  in  each  subgroup. 
These  simulated  data  sets  were  then  calibrated  by  a  unidimensional  IRT 
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program,  BILOG,  and  each  item  was  examined  for  item  bias  using  an  item 
parameter  invariance  index,  the  unsigned  sum  of  squared  differences  (USOS) 
index  with  baseline  method.  Finally,  the  occurrence  of  false  positives  under  the 
no-bias  situation  and  the  bias  situation  was  examined.  In  the  bias  situation,  the 
true-positive  detection  rate  was  also  examined. 

The  most  substantial  occurrence  of  false  positives  was  observed  when  the 
test  data  had  two  dominant  traits,  and  the  two  traits  had  different  correlations  for 
the  two  groups  under  investigation.  A  steady  increase  in  the  number  of  false 
positives  was  observed  as  the  correlation  difference  became  larger.  When  the 
correlation  between  the  two  traits  was  0  in  the  reference  group  and  .8  in  the 
focal  group,  29  out  of  40  unbiased  items  were  incorrectly  identified  as  biased  in 
the  most  extreme  case.  On  the  other  hand,  generally  the  occun-ence  of  false 
positives  was  not  substantial  when  there  were  between-group  mean  differences 
on  the  dominant  traits.  With  regard  to  true-positive  detection  rate,  when  the  test 
was  dominantly  unidimensional  and  when  more  than  10%  of  the  items  were 
multidimensional,  the  USOS  index  was  not  sufficiently  sensitive  to  detect  all  the 
truly  biased  items.  According  to  these  results,  a  cautious  use  of  the  IRT-based 
item  bias  index  (USOS)  with  the  baseline  method  is  suggested  when  the  test  is 
suspected  of  being  multidimensional,  because  a  correlation  difference  between 
two  groups  may  seriously  affect  the  accuracy  of  item  bias  detection. 
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CHAPTER  1 
INTRODUCTION 


According  to  Crocker  and  Algina  (1986), 

a  set  of  items  is  unbiased  if  (1)  the  items  are  affected  by  the  same 
sources  of  variance  in  both  subpopulations;  and  (2)  among  examinees 
who  are  at  the  same  level  on  the  constnjct  purportedly  measured  by  the 
test,  the  distributions  of  irrelevant  sources  of  variation  are  the  same  for 
both  subpopulations.  (p.  377) 

They  provided  an  example  of  an  unfair  advantage  to  one  subpopulation  of 
examinees  over  another  subpopulation  in  the  following  situation: 

An  item  on  a  well-known  individual  intelligence  test  asks,  "What  is  the 
thing  to  do  if  you  find  someone's  wallet  in  a  store?"  The  correct  answer  is 
to  report  the  discovery  to  someone  in  charge  of  the  store;  however,  it  is 
sometimes  argued  that  such  an  item  may  be  biased  against  children  from 
low-income  families  since  taking  the  money  home  to  a  parent  might 
seem  to  be  a  more  'sensible'  response  for  these  children  than  for 
children  from  more  affluent  homes,  (p.  376) 


According  to  the  preceding  definition,  item  bias  can  occur  when  scores  of 
different  subpopulations  may  be  affected  by  different  sources  of  variation.  It  is 
also  possible  for  item  bias  to  occur  when  scores  of  different  subpopulations  are 
affected  by  the  same  sources  of  variation  (traits  which  a  test  purportedly 
measures  as  well  as  extraneous  sources  of  variation),  but  the  extraneous 
sources  influence  performance  differently  among  subpopulations.  The  latter 
case  is  the  focus  of  this  study. 
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In  recent  years  various  statistical  procedures  have  been  introduced  to  detect 
item  bias  resulting  from  racial,  ethnic,  gender,  and  other  demographic 
differences  among  examinees.  Among  these  techniques,  those  based  on  item 
response  theory  (IRT)  are  among  the  most  promising  because  these  techniques 
do  not  confound  item  bias  with  between-groups  differences  in  ability  (Drasgow, 
1987;  Ironson,  1983).  At  the  most  abstract  level,  IRT  provides  a  probabilistic 
way  of  linking  individuals'  observable  item  responses  to  theoretical  constructs 
contained  in  psychological  theories  (Hulin,  Drasgow,  &  Parsons  1983).  The 
model  specifies  the  relationship  between  the  probability  of  answering  an  item 
correctly  and  the  student's  ability.  This  relationship  is  called  an  item 
characteristic  curve  (ICC).  At  the  current  stage  of  development,  the  most  useful 
IRT  models  incorporate  the  unidimensionality  assumption:  The  response 
probabilities  are  the  function  of  a  single  latent  characteristic  of  the  examinees. 
Critics  argue  that  the  unidimensionality  assumption  is  somewhat  unrealistic, 
and  a  number  of  researchers  have  investigated  the  effects  of  applying  the 
unidimensional  IRT  model  to  multidimensional  data  (e.g.,  Drasgow  &  Parsons, 
1983;  Hattie,  1985;  Reckase  &  McKinley,  1983). 

Statement  of  the  Problem 

Unidimensionality  is  also  a  critical  assumption  of  IRT-based  techniques  for 
detecting  item  bias.  According  to  IRT  the  lack  of  invariance  of  ICCs  is  an 
indication  of  item  bias.  However,  when  examinee  responses  to  a 
multidimensional  item  set  are  analyzed  using  a  unidimensional  IRT  model,  the 
ICCs  may  vary  over  subpopulations  simply  because  the  model  is  inappropriate 
for  the  data  (Hunter,  1975).  Moreover,  this  can  occur  even  when  the  use  of  an 
appropriate  multidimensional  model  would  yield  ICCs  for  the  subpopulations 
that  are  similar.  Stated  differently,  "even  when  item  bias  does  not  exist, 
differences  between  the  ICCs  can  occur  if  the  items  are  not  unidimensional" 


(Crocker  &  Algina,  1986,  p.  392).  For  example,  Miller  and  Linn  (1988)  and 
Muthen  (1987)  have  shown  that  achievement  tests  are  likely  to  be 
multidimensional  due  to  instructional  differences,  and  that  these  instructional 
differences  can  cause  differences  in  item  characteristic  curves. 

The  presence  of  multidimensionality,  however,  does  not  necessarily  Imply 
item  bias.  As  noted  at  the  beginning  of  this  chapter,  an  item  is  not  biased  if  the 
sources  of  variation  are  the  same  in  both  subpopulations  and  if  the  distributions 
of  irrelevant  sources  of  variation  are  the  same  for  the  two  groups.  This  definition 
suggests  the  possibility  of  an  item  that  is  multidimensional  but  not  biased.  The 
distinction  between  a  multidimensional  item  and  a  biased  item  is  important 
because  a  number  of  researchers  have  suggested  that  on  some  types  of  tests 
two  or  more  traits  are  required  to  answer  a  typical  question  (e.g.,  Reckase  & 
McKinley,  1983;  Traub,  1983).  If  all  such  items  were  eliminated  because  these 
items  appeared  to  be  biased,  the  content  and  construct  validity  of  the  test  scores 
would  be  seriously  altered.  Unfortunately  research  has  been  lacking  on  the 
effect  of  applying  item  bias  detection  techniques  based  on  unidimensional  IRT 
to  various  types  of  multidimensional  data.  In  the  present  study,  the  effect  of 
various  types  of  multidimensionality  on  item  bias  detection  using  IRT  was 
examined. 

Purpose  of  the  Studv 
Two  types  of  en'ors  can  be  made  in  item  bias  studies:  (a)  truly  unbiased  items 
can  be  declared  to  be  biased  (false  positive  errors),  and  (b)  truly  biased  items 
can  be  declared  to  be  unbiased  (false  negative  errors).  In  the  context  of 
multidimensionality,  the  false  negative  errors  could  occur  when  an  item  is 
multidimensional  but  not  biased  .  That  is  the  case  when  there  are  irrelevant 
sources  of  variation  with  similar  distributions  in  both  subgroups  or  when  the 
construct  itself  is  multidimensional.  The  false  negative  error  could  occur  when 


IRT-based  techniques  fail  to  indicate  true  bias  due  to  multidimensionality  of  the 
construct.  The  purpose  of  this  study  was  to  examine  the  extent  to  which  these 
two  types  of  error  occurred  when  a  unidimensional  IRT  bias  index  was 
calculated  for  simulated  multidimensional  data  sets  in  which  item  parameters 
and  ability  distributions  of  the  two  groups  were  systematically  manipulated. 
Data  sets  were  generated  based  on  a  recently  developed  multidimensional  IRT 
model  by  Reckase  (1986).  This  model  provides  an  appropriate  framework  in 
which  hypothetical  multidimensional  abilities  and  items  can  be  easily  specified 
as  item  response  data  are  generated. 

Significance  of  the  Study 
It  is  important  for  a  test  to  be  fair  to  every  examinee  in  a  heterogeneous 
population.  Stated  differently,  a  test  should  be  constructed  so  that  individuals 
who  possess  the  same  amount  of  an  underlying  trait,  but  who  come  from 
different  subpopulations,  have  the  same  probability  of  responding  to  an  item 
correctly  (in  ability  measurement)  or  positively  (in  attitude  measurement).  Two 
recently  settled  legal  cases  (Golden  Rule  Insurance  Company  et  al.  v. 
Washburn  et  al.,  1984;  Allen  v.  Alabama  State  Board  of  Education,  1985) 
illustrate  how  the  choice  of  a  statistical  procedure  for  measuring  item  bias  can 
have  important  practical  consequences.  For  example,  Drasgow  (1987) 
summarized  the  first  case  as  follows: 

The  Golden  Rule  Insurance  Company  filed  suit  against  the  Educational 
Testing  Service  (ETS)  and  Illinois  Department  of  Insurance  when  it  was 
found  that  blacks  had  higher  failure  rate  than  whites  on  various  Illinois 
insurance  licensing  exams  constructed  by  ETS.  .  .  .  {As  a  settlement,} 
the  ETS  agreed  to  select  items  for  which  the  proportions  of  correct 
answers  for  whites  and  blacks  differ  by  no  more  than  .15  when  such 
items  are  available,  (p.  19) 

Unfortunately,  it  was  an  undesirable  settlement  from  a  psychometric  point  of 
view  because  it  resulted  in  elimination  of  items  that  discriminated  effectively 


between  high  and  low  ability  examinees.  Drasgow  stated  in  his  conclusion  that 


measurement  bias  should  not  be  investigated  using  the  proportion- 
correct  statistic  because  it  confounds  bias  with  between-group 
differences  in  the  attribute  measured  by  the  test. ...  To  reach  con-ect 
conclusions  about  bias  it  is  necessary  to  use  a  method  that  does  not 
confound  measurement  bias  with  between-group  differences.  One  such 
method  is  provided  by  IRT.  (p.  28) 


As  depicted  above,  the  development  of  psychometrically  justified  item  bias 
techniques,  such  as  those  based  on  IRT,  is  significant.  As  typically 
implemented,  IRT  is  based  on  a  rather  stringent  assumption  of 
unidimensionality.  In  many  situations,  however,  this  assumption  may  be 
untenable.  The  effect  of  the  violation  of  the  assumption  on  the  IRT-based  item 
bias  indices  needs  to  be  examined.  Results  of  this  study  should  provide 
information  about  how  the  two  typ6s  of  error  rates  are  affected  when  the 
unidimensionality  assumption  is  not  met. 


CHAPTER  2 
REVIEW  OF  THE  LITERATURE 


The  review  of  the  literature  in  this  chapter  includes  an  overview  of  item  bias 
techniques  with  a  focus  on  IRT-based  item  bias  techniques.  Also  presented  is 
the  literature  related  to  the  issue  of  the  unidimensionality  assumption  in  IRT. 
Finally,  the  literature  on  multidimensionality  and  item  bias  is  reviewed. 

Detecting  Item  Bias 
Overview  of  Item  Bias  Techniques 

In  recent  years,  a  number  of  techniques  have  been  devised  for  detecting  item 
bias.  The  two  major  approaches  are  (a)  methods  based  on  IRT  and  (b) 
techniques  that  are  simpler  to  apply  than  IRT  procedures,  such  as  chi-square 
techniques  and  the  Mantel-Haenszel  statistic.  These  techniques  are  similar  in 
that  each  is  based  on  the  notion  that  "an  item  is  considered  unbiased  if 
individuals  with  equal  ability,  but  from  different  groups,  have  the  same 
probability  of  answering  the  item  correctly"  (Ironson  1982,  p.1 17-1 18).  The 
techniques  allow  the  test  user  to  estimate  ability  from  information  provided  by 
the  test  as  a  whole.  Using  these  estimates,  different  groups  are  then  compared 
on  the  probability  that  subjects  with  equal  ability  estimates  will  answer  the  item 
correctly. 

With  Scheuneman's  (1979)  chi-square  technique  total  test  scores  serve  as 
the  ability  measures;  each  group's  observed  and  expected  scores  are 
compared  within  each  ability  level.  The  Mantel-Haenszel  (1959)  technique 
recently  has  aroused  an  increasing  level  of  interest  as  a  technique  to  detect 
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item  bias  (also  l^nown  as  unexpected  differential  item  performance).  Again, 
total  scores  are  used  as  ability  measurements,  and  a  common  odds-ratio 
statistic  is  calculated  for  the  comparison  across  groups  (see  Holland,  1985). 
Finally,  in  item  response  theory  ICCs  are  compared  in  order  to  detect  item  bias. 
IRT  procedures  are  designed  to  be  insensitive  to  the  different  shapes  of  the 
distribution  of  ability  in  the  groups  of  interest  (Ironson,  1983). 

Both  Scheuneman's  chi-square  and  the  Mantel-Haenszel  statistic  are 
considered  to  be  rough  approximations  to  the  more  psychometrically  sound  IRT 
approach  (e.g.,  Shepard,  Camilli,  &  Averill,  1981).  In  both  methods,  ability  is 
estimated  by  the  total  observed  score  and  ability  categories  are  created  by 
dividing  the  total  score  distribution  into  levels.  In  both  methods,  matching  of 
examinees  on  observed  score  categories  and  the  arbitrariness  of  establishing 
ability  levels  are  critical  disadvantages.  On  the  other  hand,  the  IRT  approach 
employs  latent  trait  scores.  Using  latent  ability  estimates  allows  for  comparison 
of  two  groups  without  the  rather  subjective  matching  of  ability.  Therefore,  the 
IRT  approach  appears  to  be  the  most  promising.  However,  it  has  the 
disadvantages  of  high  cost,  large  sample  size  requirement,  and  the  rather 
stringent  assumption  of  unidimensionality.  In  the  next  section,  the  IRT  approach 
to  item  bias  detection  is  discussed  in  detail. 
IRT-based  Item  Bias  Techniques. 

According  to  Lord  (1980),  "if  each  test  item  in  a  test  had  exactly  the  same 
Item  response  function  in  each  group,  then  people  at  any  given  level  of  ability  or 
skill  would  have  exactly  the  same  chance  of  getting  the  item  right  regardless  of 
their  group  membership"  (p.212).  He  called  such  a  test  completely  unbiased. 
The  principle  of  this  approach  to  detecting  biased  items  is  to  obtain  an  ICC  for 
each  subgroup  separately  and  to  compare  the  ICCs  to  see  if  they  "differ."  There 
are  several  indices  used  for  comparison  depending  on  the  model  of  IRT. 


The  most  general  model  is  the  three-parameter  logistic  model  which  was 
formulated  by  A.  BIrnbaum  and  published  in  Lord  and  Novick  (1968).  The 
equation  for  the  ICC  in  this  model  is 


Daj(e-bi) 

(1-q)  e 

Pi  (9)    =  Ci  +    (1) 

Dai(0-bi) 

1  +  e 

where  Pi(9)  is  the  probability  of  a  correct  response  to  item  1  by  a  person  with 
ability  level  6;  aj,  bj,  and  q  are  discrimination,  difficulty,  and  guessing 
parameters  respectively;  and  D  is  a  constant  customarily  set  at  1 .7. 
Two  types  of  item  bias  indices  are  used  with  the  three-parameter  model:  (a) 
the  test  statistics  for  the  hypothesis  tests  of  the  equality  of  the  parameters  across 
subgroups  and  (b)  the  area  or  sum  of  squares  between  the  ICCs.  The  first 
approach  was  proposed  by  Lord  (1980)  to  test  the  null  hypotheses  that  a\^=  a\2 
and  bii  =  b\2,  and  is  known  as  Lord's  chi-square.  Because  of  the  difficulty  in 
accurate  estimation  of  Cj  for  separate  subgroups,  the  combined  sample  is  used 
to  estimate  c\.  The  aj  and  bj  are  subsequently  reesti mated  for  each  sample 
separately  holding  q  constant,  standardizing  on  bj  rather  than  6.  The  null 
hypothesis,  that  jointly  aji=  ai2  and  bji=  bj2,  is  tested  by  a  chi-square  statistic. 

The  second  procedure  was  developed  by  Rudner  (1977).  In  this  procedure, 
the  area  between  the  ICCs  for  two  subgroups  is  calculated  as  a  measure  of  the 
difference  between  two  ICCs.  The  area  between  ICCs  of  item  i  is  approximated 
by  the  formula 

6=4.00 

Ai=I     .005|Pii(e)-Pi2(e)|  (2) 
e=-4.oo 


where  Pii(e)  and  Pi2(0)  refer  to  the  values  of  the  height  of  ICC  for  the  two 
groups  respectively  and  are  calculated  for  each  value  of  6  from  -4.00  through 
4.00  in  steps  of  .005.  Linn,  Levine,  Hastings,  and  Wardrop  (1981)  suggested 
alternatives  to  Rudner's  area  measure.  The  formula  for  the  square  root  of  the 
sum  of  squared  (RSOS)  differences  between  the  item  response  curves  of  a 
given  item  for  the  two  subgroups  is 

3 

RSOS  =  {(.01)  I  [Pii(e)  -  Pi2(e)]2}  05.  (3) 
e=-3 

In  this  formulation,  the  distance  from  9  =  -3  to  9  =  3  is  divided  into  600  intervals 
of  width  .01 .  Linn  et  al.  also  discussed  weighted  versions  of  this  procedure  in 
which  the  difference  between  the  curves  is  weighted  by  an  estimate  of  the 
standard  error  of  the  difference  between  the  curves.  However,  the  weighted 
version  has  not  appeared  to  be  more  useful  than  the  simpler,  unweighted 
indices  (Hulin,  Drasgow,  &  Parsons  1983).  Slightly  different  versions  from  the 
indices  of  Rudner  and  Linn  et  al.  have  been  suggested  for  both 
weighted/unweighted  and  signed/unsigned  formulas  (Shepard,  Camilli,  & 
Williams,  1985).  In  signed  versions,  the  direction  of  the  bias  is  taken  into 
account.  Shepard  et  al.  (1985)  discussed  various  types  of  IRT  indices  including 
signed  and  unsigned  area,  signed  and  unsigned  sum  of  squares,  and  weighted 
signed  and  unsigned  sum  of  squares. 

Both  Lord's  chi-square  and  the  area  measures  have  advantages  and 
disadvantages.  The  problem  of  Lord's  chi-square  is  that  ICCs  can  have  quite 
different  parameters  and  still  be  substantially  the  same  (Linn  et  al.,  1981).  In  a 
recent  study,  for  example,  McLaughlin  (1986)  showed  by  simulation  that  many 


more  items  were  found  to  be  "biased"  as  a  result  of  Type  I  errors  than  would 
have  been  expected  on  the  basis  of  the  nominal  alpha  level  of  the  test  when 
abilities  and  item  parameters  were  estimated  simultaneously.  Unlike  Lord's  chi- 
square,  the  index  of  Linn  et  al.  lacks  statistical  justification.  The  sampling 
distributions  of  the  bias  indices  are  unknown  for  the  condition  of  "no  difference 
between  ICCs".  Therefore,  a  test  for  statistical  significance  cannot  be 
performed. 

The  methods  developed  by  Lord  and  Linn  et  al.  can  also  be  applied  to  the 
two-parameter  model.  Hulin,  Drasgow,  and  Komocar  (1982)  developed  an 
indirect  test  of  item  bias  for  the  two-parameter  logistic  model  using  the  F  test.  By 
a  logit  transformation,  ICCs  are  transformed  into  a  linear  function  of  9  .  Then 
two  regression  lines  from  each  subgroup  are  compared  for  equality.  Although 
this  procedure  is  straightforward,  it  is  applicable  only  to  a  test  in  which  there  is 
no  guessing,  such  as  an  attitude  test. 

Wright,  Mead,  and  Draba  (1976)  proposed  difficulty  shift  as  an  indicator  of 
bias  in  the  one-parameter  logistic  model  (Rasch  model).  Wright  and  Stone 
(1979)  provided  another  test  for  bias  in  the  Rasch  model  utilizing  the  fit  of  the 
model  to  item  response  data.  Because  of  the  problems  of  the  Rasch  model 
itself,  such  as  the  implausible  assumptions  of  equal  item  discriminations  and  no 
guessing  on  a  test  (see  Divgi,1986,  for  details),  the  item  bias  techniques  based 
on  this  model  appear  to  be  less  promising. 

All  the  procedures  mentioned  above  are  based  on  a  unidimensionality 
assumption,  which  is  one  of  the  most  critical  assumptions  in  IRT.  Unless  this 
assumption  is  met,  parameter  estimation  in  IRT  may  be  spurious,  and 
consequently  the  item  bias  indices  based  on  IRT  will  be  inaccurate.  Several 
recent  investigations  of  robustness  of  IRT  when  the  unidimensionality 
assumption  is  violated  have  been  conducted;  however,  the  impact  of  violation  of 
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the  unidimensionality  assumption  on  the  IRT-based  item  bias  techniques  has 
been  relatively  unexplored.  In  the  next  section,  the  unidimensionality 
assumption  in  IRT  is  discussed. 

Unidimensionality  Assumption  in  IRT 
Multidimensionalitv  in  Tests 

A  basic  assumption  in  the  use  of  most  of  the  latent  trait  models  is  that  the 
hypothetical  variable  measured  by  a  test  can  be  described  in  one-dimensional 
latent  space  (Lord  &  Novick,1968).  In  other  words,  only  one  trait  underiies  an 
individual's  test  performance.  This  assumption  leads  to  the  principle  of  local 
independence  in  which  items  are  statistically  independent  at  a  given  level  of  9  . 
This  assumption  is  rather  stringent  because  "every  test  and  every  set  of 
responses  by  real  individuals  is  multidimensional  to  some  degree"  (Hanison, 
1986,  p.91).  Drasgow  and  Lissak  (1983)  stated: 

It  appears  likely  that  more  than  a  single  common  factor  could  be 
extracted  from  the  item  correlations  of,  say,  a  veriDal  ability  test  that  uses 
a  variety  of  item  types,  an  attitude  instrument  that  attempts  to  measure  a 
"substantially  rich"  construct . . . ,  or  a  test  of  achievement  in  a  reasonably 
broad  domain,  (p.364 ) 

Traub  (1983)  pointed  out  the  problem  of  model  misfit  in  the  application  of  item 
response  theory,  and  concluded  that  "no  unidimensional  item  response  model 
is  likely  to  fit  educational  achievement  data "  (p.  65).  Traub  described  three 
factors  present  in  item  responses  to  a  typical  educational  achievement  test  that 
would  violate  unidimensionality:  instructional  emphasis,  speed  of  wori<,  and 
individual  propensities  to  guess.  As  McKinley  and  Kingston  (1988)  pointed  out, 
science  or  math  tests  are  likely  to  be  multidimensional.  In  a  science  test,  items 
are  sampled  from  several  content  areas  such  as  physics  and  chemistry.  A  math 
test  may  contain  multi-faceted  items  such  as  word  problems  requiring  a  high 
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level  of  reading  comprehension.  When  Ackerman  (1988)  applied  a 
multidimensional  IRT  model  to  the  ACT  (American  College  Testing)  assessment 
mathematics  usage  test,  he  found  two  dimensions:  a  computational  dimension 
and  a  verbal  dimension.  Not  only  are  many  tests  likely  to  be  multidimensional, 
but  several  different  types  of  multidimensionality  have  been  suggested.  Began 
and  Yen  (1983)  simulated  four  types  of  test  configurations  that  could  occur  in  a 
possible  two-trait  situation.  In  each  configuration,  item  parameters  were  set  in 
such  a  way  that  it  represented  a  practical  situation,  such  as  a  science  test 
where  an  examinee's  item  responses  could  be  the  result  of  the  two  traits: 
reading  ability  and  knowledge  of  science  facts,  or  a  social  studies  test  where  a 
few  items  require  both  reading  ability  and  ability  to  understand  graphs,  and 
other  items  require  reading  only.  In  spite  of  this,  unidimensional  IRT  has  been 
widely  used  as  if  most  tests  measure  only  one  trait. 

With  regard  to  this  critical  assumption  of  unidimensionality  in  IRT,  there 
appear  to  be  three  directions  in  current  research  in  this  field.  First,  several 
multidimensional  IRT  models  have  been  proposed.  Second,  procedures  to 
assess  unidimensionality  have  been  extensively  researched.  Third,  the  ^ 
robustness  of  estimation  procedures  has  been  studied  when  unidimensional 
IRT  is  applied  to  multidimensional  data  . 
Multidimensional  IRT 

Several  multidimensional  IRT  (MIRT)  models  have  been  proposed  (Bock  & 
Aitkin,  1981;  Reckase  &  McKinley,  1983;  Samejima,  1974;  Sympson,  1978; 
Thissen  &  Steinberg,  1984;  Whitely,  1980).   However,  few  researchers  have 
made  use  of  multidimensional  models  to  analyze  data  as  of  yet  because  of  their 
complexity  and  problems  of  parameter  estimation.  There  are  some  computer 
programs,  such  as  TESTFACT  (Wilson,  Wood,  &  Gibbons,  1984),  MAXLOG 
(McKinley  &  Reckase,  1983),  and  MIRTE  (Carlson,  1987),  available  for 
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implementing  multidimensional  models  .  However,  none  has  undergone 
exhaustive  testing  (Batley  &  Boss  1988). 
Assessment  of  Unidimensionalitv 

There  are  numerous  procedures  to  assess  unidimensionality.  Hattie  (1985) 
identified  over  30  indices  of  unidimensionality  in  his  survey  of  existing  methods. 
Hattie  stated  "yet  despite  its  importance,  there  is  not  an  accepted  and  effective 
index  of  the  unidimensionality  of  a  set  of  items"  (p.  139).  Stout  (1987)  too 
agreed  that  "there  is  certainly  no  widespread  agreement  among 
psychometricians  as  to  which  procedure  or  procedures  are  appropriate  to  use" 
(p.  3). 

Among  many  approaches,  a  commonly  used  method  is  the  linear  factor 
analytic  approach  (Drasgow  1983;  Hambleton  &  Murray  1983, ).  In  this 
approach,  a  matrix  of  inter-item  correlations  is  subjected  to  factor  analysis. 
However,  several  researchers  (e.g.,  Kingston  &  McKinley,  1988)  have 
documented  concerns  with  this  procedure,  noting  problems  inherent  to  the 
correlation  coefficients  (phi  and  tetrachoric  coefficients).  Holland  and 
Rosenbaum  (Holland,  1981;  Holland  &  Rosenbaum,  1985;  Rosenbaum,  1984) 
and  Stout  (1987)  have  developed  a  contingency -table  approach,  based  on 
"conditional  association"  as  a  necessary  condition  for  local  independence. 
Bejar  (1980)  proposed  an  approach  which  is  useful  when  the  test  is  divided  into 
content  categories.  Drasgow  and  Lissak  (1983)  developed  a  data-simulation 
method  called  modified  parallel  analysis  (MPA).  In  MPA,  factor  analysis,  as  well 
as  IRT  analysis,  are  employed  to  detect  violations  of  unidimensionality  that 
interfere  with  parameter  estimation. 
Robustness  of  IRT  with  Multidimensional  Data 

Does  a  test  have  to  be  strictly  unidimensional  for  the  IRT  parameter 
estimation  to  be  valid?  The  answer  seems  to  be  "no"  for  at  least  some  cases. 
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Hambleton  and  Murray  (1983)  pointed  out  that  statistical  tests  in  which  the  null 
hypothesis  of  unidimensionality  is  tested  by  using  factor  analysis  are  probably 
too  strong  a  prerequisite  for  the  use  of  IRT  methods,  because,  when  the  sample 
size  is  large,  the  tests  are  powerful  enough  to  always  reject  the  null  hypothesis. 
The  robustness  of  IRT  parameter  estimation  to  the  violation  of  the 
unidimensionality  assumption  has  been  investigated  in  various  simulation 
studies.  Reckase  (1979)  used  both  real  data  and  simulated  data  to  investigate 
the  effect  of  multidimensionality  when  unifactor  latent  trait  models  were  applied. 
Reckase  modeled  two  multidimensional  data  structures:  one  with  a  dominant 
latent  trait  and  a  small  number  of  weak  latent  traits  and  one  with  two  statistically 
independent  latent  traits.  Parameters  from  the  dominant  latent  trait  were 
recovered  adequately  using  the  LOGIST  computer  program.  When  there  were 
two  independent  traits,  one  factor  still  emerged  as  dominant  in  ability 
estimation.  Drasgow  and  Parsons  (1983)  discovered  that  a  tmly 
unidimensional  latent  trait  space  is  not  necessary  for  accurate  estimation  of  IRT 
parameters.  They  simulated  a  hierarchical  factor  stnjcture  in  which  there  was 
one  second-order  general  factor  and  five  correlated  first-order  common  factors 
for  which  correlations  among  common  factors  ranged  from  .02  through  .90. 
Under  this  condition,  the  LOGIST  computer  program  recovered  the  general 
latent  trait  when  the  prepotency  of  the  general  latent  trait  was  only  moderate 
(correlations  between  common  factors  were  .48  or  above).  Harrison  (1986) 
substantiated  Drasgow  and  Parson's  study.  He  systematically  manipulated  the 
strength  of  a  second-order  general  factor,  the  number  of  first-order  common 
factors,  the  distribution  of  items  loading  on  those  common  factors,  and  the 
number  of  items^in  simulated  tests.  The  computer  program  LOGIST  effectively 
recovered  both  item  parameters  and  trait  parameters  implied  by  the  general 
factor  in  most  of  the  hierarchical  structures  where  correlations  between  common 
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factors  ranged  from  .46  through  1 .00.  The  above  studies  revealed  that 
parameter  estimation  in  IRT  is  robust  to  the  violation  of  the  unidimensionality 
assumption  as  long  as  the  common  factors  have  moderate  intercon'elations. 
Harrison  stated  that  IRT  can  be  applied  to  a  multidimensional  test  in  which 
speed  of  work  and  guessing  are  additional  traits,  because  speed  of  work  and 
propensities  to  guess  are  likely  to  be  correlated.  Stated  differently,  as  long  as 
traits  create  oblique  common  factors,  the  deviation  from  unidimensionality  is 
allowable  in  the  IRT  procedure.  Their  findings  are  significant  because  an 
achievement  test  tends  to  have  oblique  common  factors  when  the  test  has 
multiple  factors.  Reckase,  Ackerman,  and  Carlson  (1988)  suggested  that  the 
concept  of  unidimensionality  required  by  IRT  is  not  the  same  as  the  commonly 
held  conception  of  unidimensionality.  They  demonstrated,  both  theoretically 
and  empirically,  that  sets  of  items  that  measure  the  same  composite  of  abilities 
defined  by  multidimensional  IRT  were  shown  to  meet  the  unidimensionality 
assumption.  The  above  studies  suggest  a  need  for  redefining  the  concept  of 
unidimensionality  which  is  a  critical  assumption  of  item  response  theory. 

The  effect  of  using  unidimensional  models  to  analyze  multidimensional  data 
for  the  purpose  of  equating  and  computer  adaptive  testing  (CAT)  also  has  been 
studied  over  the  past  several  years  (Ackerman  1987;  Began  &  Yen,  1983; 
Dorans  &  Kingston,  1985;  Yen  1984).  By  and  large,  the  resulting  estimates 
have  not  accurately  reflected  the  original  characteristics  of  the  data  and  have 
been  difficult  to  interpret  unless  there  was  clearly  one  dominant  dimension. 
Multidimensionalitv  and  IRT-Based  Item  Bias  Techniques 

Hunter  (1975)  showed,  using  a  hypothetical  achievement  test,  that  when  a 
test  is  multidimensional,  a  difference  between  item  characteristic  curves  is  not 
necessarily  evidence  of  "bias."  His  explanation  was  intuitive,  and  no  IRT 
estimation  was  involved.  Drasgow  (1987)  pointed  out  that  "if  these  assumptions 
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{i.e.,  unidimensionality}  are  not  reasonably  satisfied,  then  violation  of 
assumption  may  cause  truly  unbiased  items  to  appear  biased  in  statistical  tests" 
(p.20).  Therefore,  multidimensionality  has  been  suspected  to  cause  unbiased 
items  to  appear  biased. 

A  series  of  researchers  have  examined  the  effect  of  instructional  differences 
on  unidimensional  IRT  (Miller  &  Linn,  1988;  Muthen,  Kao,  &  Burstein,1988;  Vojir 
&  Shepard,  1988).  Muthen  (1987)  pointed  out  the  possibility  of  misestimation  of 
bias  due  to  items  being  instructionally  sensitive.  Miller  and  Linn  (1988)  found 
that  after  forming  curriculum  clusters  based  on  teachers'  ratings  of  their 
students'  opportunities  to  learn  the  items  on  a  test,  the  differences  in  the  item 
response  curves  between  curriculum  clusters  were  found  to  be  much  larger 
than  differences  reported  in  prior  studies  based  on  comparisons  of  black  and 
white  students.  By  and  large,  these  studies  suggest  that  achievement  tests  are 
likely  to  be  multidimensional  and  that  instructional  differences  cause  differences  . 
in  item  characteristic  curves. 

Ackerman  (1988)  suggested  that  differential  item  functioning  (DIF),  created 
by  model  misspecification,  can  be  accurately  predicted  if  the  multidimensional 
IRT  item  parameters  and  the  supporting  trait  distribution  (STD)  are  known.  His 
study  followed  Reckase's  (1986)  multidimensional  IRT  (MIRT)  model  in  its 
geometric  perspective.  The  two-parameter  logistic  model  (M2PL)  is 

exp(ar  Sj  +  si ) 
P(i$ij  =  1  I  ai,  di,  8j)  =    (4) 

1  +  exp  {a\  Sj  +  di  ) 

where  ^\  is  the  score  (0,1 )  on  item  i  by  person  j,  ai  is  the  vector  of  item 
discrimination  parameters,  di  is  a  scalar  parameter  related  to  the  difficulty  of  the 
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item,  and  9]  is  the  vector  of  ability  parameters  for  person  j.  This  is  a 
compensatory  model  in  which  high  values  of  one  ability  dimension  compensate 
for  low  values  of  another  ability  dimension.  The  probability  of  correct  response 
can  be  described  by  an  equal  probability  contour  plot.  For  example,  in  a  two- 
dimensional  space,  equal  probability  contour  lines  are  all  straight  lines  and  any 
combinations  of  e-values  that  fall  on  the  same  line  yield  the  equal  probability  of 
correct  response. 

Reckase  (1985)  defined  a  multidimensional  item  difficulty  (MID)  parameter, 
Dj,  such  that 

m 

Di=-di/[5:  (aik)2]05.  (5) 
k=1 

where  m  is  the  number  of  dimensions,  and  aik  is  an  element  of  ai  in  the  k^^  ability 
dimension.  This  parameter  represents  the  distance  from  the  origin  of  the  ability 
space  to  the  point  at  which  the  item  response  surface  (IRS)  has  the  steepest 
slope.  This  point  is  where  the  item  provides  the  most  information  about  the 
person  being  measured.  In  other  words,  it  is  the  point  of  inflection  of  the  IRS  in 
a  particular  direction.  The  point  is  obtained  by  first  translating  Equation  4  into 
polar  coordinates,  then  taking  the  second  derivative  of  that  equation  with 
respect  to  9j,  where  6j  is  the  distance  from  the  origin  to  9j,  and  solving  for  its  zero 
point.  The  result  is  that  the  second  derivative  is  zero  when 
P(2<ij  =  1  I  ai,  di,  9j)  =  -5.  Thus,  the  slope  in  a  particular  direction  is  at  its 
maximum  when  the  IRS  crosses  the  .5  equal  probability  plane.  The  line  joining 
the  point  at  which  the  IRS  has  the  steepest  slope  to  the  origin  is  at  an  angle  of 
Ojkto  the  k^^  ability  dimension  where 


m 

cosoqk=  aik  /  [  I  ( aik)2]0.5  .  (6) 
k=1 

Reckase  (1985)  also  determined  that  the  slope  in  the  MID  direction  is 

m  m 
Slope=  I  Iaikcosoqk  =  j  [I(aik)2]0-5  .  (7) 
k=i  k-1 

Reckase  (1986)  proposed  a  multidimensional  discrimination  parameter, 
MDISCj,  for  item  i  to  be  a  function  of  the  slope  of  the  IRS  in  the  direction 
specified  by  the  multidimensional  difficulty  as  follows: 

m 

MDISCj  =  [  S  (aik)2]05   =-di/MIDi.  (8) 
k=1 

In  Figure  1 ,  three  item  vectors  are  illustrated  geometrically  in  a  two- 
dimensional  space  (Ackerman,  1988).  Each  vector,  expressed  by  an  arrow, 
represents  an  item.  Also  illustrated  are  the  lines  of  equiprobability.  In  Figure  1 , 
only  the  p  =  .5  probability  line  which  runs  orthogonally  to  the  tip  of  the  item 
vector  is  drawn  for  each  vector.  The  circles  indicate  the  STDs  for  groups  A  and 
B.  Both  groups  have  the  same  mean  ability  for  dimension  2,  that  is 
S2A  =  028  =  1  -2,  where  Q\\  is  the  mean  on  Theta  i  for  Group  j,  but  Group  B  has  a 
higher  mean  ability  for  dimension  1 ,  that  is  0ia  =  0  and  Sib  =  2.3.  The  success 
of  the  two  groups  on  each  item  can  be  roughly  determined  by  examining  the 
p  =  .5  equiprobability  line  in  relationship  to  each  group's  ability  centroid.  In 
Figure  1 ,  vector  1  is  measuring  only  theta  2,  vector  3  only  theta  1 ,  and  vector  2 
measuring  both  theta  1  and  theta  2.  The  two  groups  will  perform  equally  well 


on  item  1  (vector  1),  but  Group  B  should  easily  outperform  Group  A  on  items  2 
and  3. 


THETA1  -  READING 

Figure  1 .  Illustration  of  three  item  vectors  with  relation  to  the  STDs  from  two 
groups  and  p  =  .5  equiprobability  lines  (Ackerman,  1988) 


The  model  by  Reckase  (1986)  provides  an  appropriate  framework  in  which 
hypothetical  multidimensional  abilities  and  item  parameters  can  be  easily 
specified  as  item  response  data  are  generated.  In  Ackerman's  (1988)  study, 
the  ACT  Assessment  Mathematics  Usage  Test  was  analyzed  by  MIRT  and  item 
parameters  were  calibrated.  A  data  set  was  simulated  using  the  M2PL  model 
with  the  item  parameters  from  the  math  test  and  with  three  different  ability 
distributions  (one  for  a  reference  or  base  group,  two  for  focal  or  primary  groups 
of  interest)  generated  by  the  IMSL  subroutine,  GGNSM.  The  performance  of 
each  focal  group  was  predicted  to  be  better,  equal  to,  or  worse  than  the 
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performance  of  the  reference  group.  The  predictions  involved  identifying  the 
direction  and  length  of  each  item's  two-dimensional  vector  In  relationship  to 
each  group's  STD.  Subsequently,  the  data  were  analyzed  according  to  a 
unldlmensional  IRT  model  using  LOGIST,  and  item  and  ability  parameters  were 
calibrated.  Finally,  the  Linn-Harnisch  (1981 )  Z  statistic  was  calculated  and  the 
results  were  compared  with  the  previously  made  predictions.  The  "hit  rates"  for 
the  predictions  ranged  from  73%  through  95%.  Thus  Ackerman  (1988)  has 
indicated  how  DIF  can  occur  when  there  is  a  misspeciflcation  of  the  latent  ability 
space. 

As  mentioned  earlier,  the  definition  of  lack  of  bias  by  Crocker  and  Algina 
(1986)  is  that  (a)  sources  of  variance  are  the  same  across  subgroups,  and  (b) 
distributions  of  irrelevant  sources  of  variation  are  the  same  among  examinees 
who  are  at  the  same  level  on  the  construct  purportedly  measured  by  a  test. 
Reckase's  (1986)  and  Ackerman's  (1988)  approach  for  delineating  the 
relationship  between  item  types  and  STD  within  the  framework  of 
multidimensional  IRT  provides  a  geometric  explanation  of  item  bias  defined  by 
Crocker  and  Algina  (1986).  Figure  2  illustrates  an  item  vector  in  a  two- 
dimensional  space.  Theta  1  is  the  trait  which  the  test  purportedly  measures. 
Theta  2  is  an  irrelevant  trait  on  the  test.  Item  X  here  measures  both  theta  1  and 
theta  2  equally.  Group  A  and  Group  B  have  different  mean  thetas  for  the  first 
dimension,  but  the  mean  thetas  for  the  second  dimension  are  the  same. 
Therefore,  for  Group  A  and  Group  B,  Item  X  is  not  biased.  On  the  contrary. 
Group  A  and  Group  C  have  different  mean  thetas  on  the  second  (in'elevant) 
dimension  as  well  as  the  first  (relevant)  dimension.  Therefore,  here  Item  X  is 
biased  for  Group  C  in  relation  to  Group  A. 
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Figure  2.  Illustration  of  an  item  vector  to  explain  bias 

According  to  the  Crocker-Algina  definition  of  Item  bias,  item  bias  occurs 
when  item  responses  are  influenced  by  a  dimension  other  than  the  dimension 
which  a  test  purportedly  measures  AND  the  distributions  of  variation  for  two 
groups  differ  on  that  irrelevant  dimension.  Figure  2  demonstrates  that  there  can 
be  cases  where  an  item  is  multidimensional,  but  not  biased.  Thus  it  is 
necessary  to  investigate  the  case  when  an  item  is  multidimensional  but  not 
biased,  to  examine  if  that  case  causes  lack  of  item  parameter  invariance. 

Another  concern  about  multidimensionality  and  bias  is  that  a  test  can 
measure  multidimensional  traits  and  all  the  traits  can  be  relevant  to  the  purpose 
of  measurement.  As  pointed  out  in  this  review,  in  various  testing  situations,  the 
construct  itself  is  multidimensional.  The  effect  of  this  type  of  multidimensionality 
on  the  IRT-based  item  bias  techniques  should  be  investigated  in  situations 
where  there  are  biased  items  and  where  no  bias  exists. 
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Summary 

First,  literature  pertinent  to  item  bias  techniques  has  been  reviewed.  Three 
approaches  reviewed  were  Scheuneman's  chi-square,  the  Mantel-Haenszel 
statistic,  and  item  response  theory.  Even  though  the  IRT-based  approach 
appears  to  be  most  sound  psychometrically  and  promising  over  other 
techniques,  it  also  suffers  from  some  disadvantages,  particularly  its  rather 
stringent  assumption  of  unidimensionality. 

Second,  the  critical  assumption  of  unidimensionality  was  discussed.  The 
studies  cited  have  suggested  that  most  tests  are  multidimensional.  There  have 
been  three  types  of  research  with  regard  to  the  unidimensionality  assumption  in 
IRT:  (a)  development  of  multidimensional  IRT,  (b)  assessment  of 
unidimensionality,  and  (c)  checks  for  the  robustness  of  IRT.  Knowledge  in  the 
last  category  has  been  recently  advanced  by  the  findings  that  a  tmly 
unidimensional  latent  trait  space  may  not  be  necessary  for  accurate  estimation 
of  IRT  parameters. 

Finally,  the  issue  of  multidimensionality  with  respect  to  item  bias  was 
delineated.  Unidimensionality  is  a  critical  assumption  in  IRT-based  item  bias 
techniques.  However,  the  impact  of  multidimensionality  on  the  IRT-based  item 
bias  techniques  is  not  well  known.  Past  research  has  suggested  that  item  bias 
causes  multidimensionality  and,  in  turn,  multidimensionality  causes  the  lack  of 
item  parameter  invariance  which  is  the  basis  of  the  IRT-based  item  bias 
detection.  The  multidimensional  IRT  model  by  Reckase  (1986)  provides  an 
appropriate  framework  in  which  hypothetical  multidimensional  abilities  and 
items  can  be  easily  specified.  With  that  model,  a  situation  where  an  item  is 
multidimensional,  but  not  biased,  is  easily  depicted.  Results  of  the  present 
study  should  reveal  the  measurement  artifacts,  if  any  exist,  of  IRT-based  item 
bias  techniques  when  multidimensionality  is  present  in  a  test. 


CHAPTER  3 
METHODOLOGY 


The  main  steps  of  the  present  study  were  to  (a)  generate  test  data  in  various 
multidimensional  stmctures  for  both  no-bias  and  bias  situations,  (b)  perform 
item  bias  detection  based  on  unidimensional  IRT,  and  (c)  examine  the  accuracy 
of  the  item  bias  detection.  Factors  experimently  manipulated  in  the 
multidimensional  data  sets  were  (a)  percentage  of  items  influenced  by  the 
irrelevant  trait,  (b)  between-group  mean  differences  in  relevant  trait(s),  and  (c) 
the  correlation  between  relevant  traits.  A  detailed  description  of  the  design, 
data  generation,  and  analysis  is  presented  in  this  chapter. 

Design 

Data  for  two  different  types  of  multidimensional  test  structures  were  simulated 
in  the  study.  The  two  types  were  defined  in  terms  of  the  number  of  traits  that 
were  considered  relevant  to  the  purposes  of  measurement.  In  the  first  type  of 
multidimensional  structure  (MD  1)  one  trait  was  relevant;  in  the  second  type  of 
multidimensional  structure  (MD  2)  two  traits  were  relevant.  Both  structures  also 
included  one  irrelevant  trait  that  influenced  only  a  small  proportion  of  the  items. 
In  each  type  of  multidimensional  structure,  both  no-bias  and  bias  situations 
were  considered. 

In  MD1  the  test  was  dominantly  unidimensional.  Trait  1  was  the  trait  the  test 
was  purportedly  measuring.  Trait  2  was  the  irrelevant  trait  which  the  test  was 
not  supposed  to  measure,  and  which  was  influencing  only  a  small  number  of 
items.  The  item  parameters  were  the  same  for  the  reference  and  focal  groups  In 
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each  condition  investigated.  In  this  study,  the  "focal  group"  refers  to  the  group  of 
interest,  and  the  "reference  group"  is  the  base  group  to  which  the  performance 
of  the  focal  group  is  compared.  Data  for  both  no-bias  and  bias  situations  were 
simulated.  For  the  no-bias  situation  both  the  focal  and  reference  groups  had 
the  same  distribution  of  scores  on  Trait  2.  For  the  bias  situation  the  focal  and 
reference  groups  had  different  means  on  Trait  2.  The  mean  difference  was 
92A  -  §28=  -5,  where  02a  and  92b  are  the  means  on  Trait  2  for  Groups  A  (focal 
group)  and  B  (reference  group)  respectively.  For  both  no-bias  and  bias 
situations,  two  factors  were  considered.  The  first  factor,  the  between-group 
mean  difference  on  Trait  1 ,  had  two  levels:  (a)  9ia  -  9iB  =  0  and  (b) 
SiA-  SiB  =  -5.  The  second  factor,  the  percentage  of  items  being  multidimensional 
with  respect  to  the  second  trait,  had  three  levels:  (a)  5%,  (b)10%,  and  (c)  20%. 
In  other  words,  5%,107o,  or  20%  of  all  the  items  (the  last  2,  4,  and  8  items  of  the 
40-item  test,  respectively)  measured  two  underlying  dimensions  or  traits.  The 
correlation  between  Trait  1  and  Trait  2  was  assumed  to  be  zero.  The 
percentages  (5%,  10%,  or  20%)  were  chosen  to  be  reasonable  for  published 
tests  when  item  bias  studies  are  conducted.  For  example,  Drasgow  (1987) 
found,  in  his  investigation  of  item  bias  in  ACT  assessment  tests  in  various 
subjects,  that  from  5%  to  29%  of  the  items  were  biased  based  on  Lord's  chi- 
square  technique.  The  lower  panel  of  Table  1  shows  three  conditions  in  the  no- 
bias  situation  under  which  ICCs  of  reference  groups  and  focal  groups  were 
compared.  The  lower  panels  of  Table  2  show  six  conditions  in  the  bias  situation 
under  which  ICCs  of  reference  groups  and  focal  groups  were  compared. 

In  MD  2  there  were  three  traits:  two  relevant  traits  (Trait  1  and  Trait  2)  and 
one  irrelevant  trait  (Trait  3)  which  influenced  only  10%  of  the  items.  In  other 
words,  the  test  was  intended  to  measure  a  multidimensional  construct.  Data  for 
both  no-bias  and  bias  situations  were  simulated.  For  the  no-bias  situation  both 
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Table  1 

Three  Baseline  and  Three  Experimental  Conditions  in  the  No-Bias  Situation  in 

mi 


Condition 

% 

Mean  Difference 

81 

5% 

eiA-9iB=  0  and  Qja 

-92B=0 

B2 

10% 

same  as  above 

B3 

20% 

same  as  above 

1 

5% 

SiA-5iB  =  -5  andSiA 

-  S2B  =  0 

2 

10% 

same  as  above 

3 

20% 

same  as  above 

Note.  B1 ,  B2,  and  B3  are  baseline  conditions;  see  section  entitled  Baselines. 


Table  2 

Three  Baseline  and  Six  Experimental  ConditioriR  in  the  Bias  Situation  in  MP  1 


Condition 

% 

Mean  Umerence 

B1 

5% 

eiA-0iB=  0  and 

-923=  0 

B2 

10% 

same  as  above 

B3 

20% 

same  as  above 

4 

5% 

SiA-5iB=  0  andSiA 

-92B  =  -5 

5 

10% 

same  as  above 

6 

20% 

same  as  above 

7 

5% 

eiA-9iB  =  -5  and  Sia 

-923  =5 

8 

10% 

same  as  above 

9 

20% 

same  as  above 

Note.  B1 ,  B2,  and  B3  are  baseline  conditions;  see  section  entitled  Baselines. 


the  focal  and  reference  groups  had  the  same  distribution  of  scores  on  Trait  3. 
For  the  bias  situation  the  focal  and  reference  groups  had  different  means  on 
Trait  3  (Gsa  -  Ssb  =  -5).  For  both  no-bias  and  bias  situations,  three  factors  were 
investigated.  The  first  factor,  number  of  traits  on  which  there  were  between- 
group  mean  differences,  had  three  levels:  (a)  no  mean  differences  on  Trait  1  or 
Trait  2  (Gia  -  Sib=  0.  52A  -  028  =  0);  (b)  a  mean  difference  on  Trait  1  but  not 
Trait  2  (Gia  -  Gib  =  .5,  G2a  -  928  =  0);  and  (c)  mean  differences  on  both  traits 
(GiA  -  Si8  =  -5,  G2A  -  Q28  =  -5).  The  second  factor,  between-group  differences  in 
the  correlation  between  Trait  1  and  Trait  2,  had  two  levels.  In  the  first  level  the 
correlation  was  the  same  in  both  groups;  in  the  second  the  correlations  varied 
across  groups.  Nested  within  the  first  level  was  the  degree  of  correlation:  0,  .5, 
and  .8.  Nested  within  the  second  level  was  the  magnitude  of  the  difference:  .5 
vs.  .8,  0  vs.  .5,  and  0  vs.  .8.  Table  3  contains  a  list  of  15  conditions  in  the  no- 
bias  situation  under  which  ICCs  of  reference  groups  and  focal  groups  were 
compared.  Table  4  contains  a  list  of  18  conditions  in  the  bias  situation  under 
which  ICCs  of  reference  groups  and  focal  groups  were  compared. 

Research  Questions 
A  set  of  research  questions  with  regard  to  the  multidimensional  structure  MD 
1 ,  in  which  tests  are  dominantly  unidimensional  but  small  percentage  of  items 
were  multidimensional,  is  as  follows: 

1 .  For  the  no-bias  situation  when  there  is  a  between-group  mean  difference 
on  Gi,  do  the  false  positives  occur  on  either  the  multidimensional  items  or  other 
unidimensional  items?  If  so,  does  the  number  of  false  positives  increase  as  the 
percentage  of  multidimensional  items  increases? 

2.  For  the  bias  situation,  does  the  true-positive  detection  rate  decrease  as  the 
number  of  biased  items  increases? 
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Table  3 

Three  Baseline  and  Fifteen  Experimental  Conditinns  in  the  No-Bias  Situation  in 

m2 


fei.ei 


Condition     Reference  Focal  Mean  Difference 


81 

.0 

.0 

01A-01B  =  .0,  Q2A- 

-S2B  = 

.0,  ande3A- 

638  = 

.0 

82 

.5 

.5 

same  as  above 

DO 

o 
.o 

Q 

.o 

1 

.0 

.0 

eiA-SiB  =  .5,  e2A 

-e2B  = 

.5,  and93A- 

638  = 

.0 

2 

.5 

.5 

same  as  above 

q 
o 

a 
.o 

a 

4 

.0 

.0 

eiA-eiB  =  -5,  e2A 

-628  = 

,0,  and  53A- 

93B  = 

.0 

5 

.5 

.5 

same  as  above 

6 

.8 

.8 

same  as  above 

7 

.5 

.8 

eiA-eiB  =  .o,  e2A 

-928  = 

.0,  and  e3A- 

633  = 

.0 

8 

.0 

.5 

same  as  above 

9 

.0 

.8 

same  as  above 

10 

.5 

.8 

eiA-eiB  =  .5,  e2A 

-628  = 

.5,  and  e3A- 

838  = 

.0 

11 

.0 

.5 

same  as  above 

12 

.0 

.8 

same  as  above 

13 

.5 

.8 

eiA-eiB  =  -5,  e2A 

-628  = 

.0,  and  e3A- 

•633  = 

.0 

14 

.0 

.5 

same  as  above 

15 

.0 

.8 

same  as  above 

Note.  81,  82,  and  83  are  baseline  conditions;  see  section  entitled  Baselines. 
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Table  4 

Three  Baseline  and  Eightfifin  Experimental  Conditions  in  the  Bias  Situation  in 
MD_2 


rei,e2 


Condition     Reference  Focal  Mean  Difference 


81 

.0 

.0 

eiA-eiB=o,  e2A- 

S2B  = 

.0,  and  e3A- 

§38  = 

.0 

B2 

.5 

.5 

same  as  above 

B3 

.8 

.8 

same  as  above 

16 

.0 

.0 

eiA-SiB  =  .0.  e2A- 

§28  = 

.0,  and  e3A- 

03B  = 

.5 

17 

5 

5 

same  as  above 

18 

.8 

.8 

same  as  above 

19 

.0 

.0 

eiA-eiB  =  -5,  GiA- 

S2B  = 

.5,  and  Gsa  - 

638  = 

.5 

20 

.5 

.5 

same  as  above 

21 

.8 

.8 

same  as  above 

22 

.0 

.0 

8iA-eiB  =  -5,  e2A- 

e2B  = 

.0,  and  e3A- 

93B  = 

.5 

23 

.5 

.5 

same  as  above 

24 

.8 

.8 

same  as  above 

25 

.5 

.8 

eiA-eiB  =  .o,  e2A- 

e2B  = 

.0,  andSsA- 

§33  = 

.5 

26 

.0 

.5 

same  as  above 

27 

.0 

.8 

same  as  above 

28 

.5 

.8 

eiA-0iB  =  .5,  e2A- 

•e2B  = 

.5,  andSsA- 

■933  = 

.5 

29 

.0 

.5 

same  as  above 

30 

.0 

.8 

same  as  above 

31 

.5 

.8 

eiA-eiB  =  .5,  e2A- 

■e2B  = 

.0,  and  93A- 

■933  = 

.5 

32 

.0 

.5 

same  as  above 

33 

.0 

.8 

same  as  above 

Note.  B1 ,  B2,  and  83  are  baseline  conditions;  see  section  entitled  Baselines. 
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3.  For  the  bias  situation,  is  the  true-positive  detection  rate  less  when  there  is 
a  between-group  mean  difference  on  61  than  when  there  is  no  between-group 
mean  difference  on  61? 

A  set  of  research  questions  with  regard  to  the  MD2  structure  when  two 
relevant  traits  underlie  the  measurement  is  as  follows: 

1 .  For  the  no-bias  situation,  do  false  positives  occur  due  to  between-group 
mean  differences  on  the  two  dominant  traits,  9i  and  62?  If  so,  does  the  number 
of  false  positives  increase  as  the  correlation  between  61  and  62  weakens? 

2.  For  the  no-bias  situation,  do  false  positives  occur  due  to  a  between-group 
mean  difference  on  the  one  dominant  trait,  9i?  If  so,  does  the  number  of  false 
positives  increase  as  the  correlation  between  0i  and  62  weakens?  Does  the 
number  of  false  positives  increase  or  decrease  compared  with  the  case  when 
there  are  between-group  mean  differences  on  the  two  traits,  81  and  62? 

3.  For  the  no-bias  situation,  do  false  positives  occur  because  the  correlation 
between  61  and  82  in  the  reference  group  differs  from  that  in  the  focal  group?  If 
it  does,  does  the  number  of  false  positives  increase  as  the  difference  between 
the  correlations  increases? 

4.  For  the  no-bias  situation,  are  more  items  identified  as  biased  when 
between-group  mean  (both  81  and  82,  or  only  81)  differences  are  combined 
with  a  between-group  correlation  difference  than  when  there  are  between- 
group  mean  differences  alone  or  when  there  is  a  between-group  correlation 
difference  alone? 

5.  For  the  bias  situation  when  there  are  no  between-group  mean  or 
correlation  differences,  does  the  true-positive  detection  rate  increase  as  the 
correlation  between  81  and  82  increases?  Does  the  number  of  false  positives 
increase  as  the  correlation  between  81  and  82  weakens? 
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6.  For  the  bias  situation  when  there  are  between-group  mean  differences 
(either  on  both  ei  and  82,  or  on  only  61),  does  the  true-positive  detection  rate 
increase  as  the  correlation  between  81  and  82  increases?  Does  the  number  of 
false  positives  increase  as  the  correlation  between  81  and  82  weakens? 

7.  For  the  bias  situation  when  there  is  a  between-group  correlation 
difference,  does  the  true-positive  detection  rate  decrease  as  the  difference 
between  the  correlations  increases?  Does  the  number  of  false  positives 
increase  as  the  difference  between  the  correlations  increases? 

8.  For  the  bias  situation  when  between-group  mean  differences  (either  on 
both  81  and  82,  or  on  only  81)  are  combined  with  a  between-group  correlation 
difference,  is  the  true-positive  detection  rate  less  than  when  there  are  between- 
group  mean  differences  alone  or  when  there  is  a  between-group  correlation 
difference  alone?  Does  the  number  of  false  positives  increase? 

Data  Generation 
Simulees  and  Ability  Distributions 

Each  simulated  data  set  included  two  groups  of  simulees:  1000  simulees 
from  a  reference  group  and  1000  simulees  from  a  focal  group.  The  ability 
(theta)  values  were  generated  by  using  the  function  RANNOR  in  the  Statistical 
Analysis  System  (SAS,  1 985).  The  sample  sizes  were  chosen  to  assure  the 
stability  of  IRT  parameter  estimation  in  both  groups  and  also  to  reflect  a  practical 
situation  in  which  a  bias  study  is  conducted  by  sampling  1000  people  from  each 
subpopulation.  In  MD  2,  correlated  thetas  (81  and  82)  were  simulated  by  first 
generating  two  independent  normally  distributed  pseudorandom  variables  z-\ 
and  22  and  then  transforming  them  to  81  and  82  by  weighted  linear 
transformations.  The  weights  were  the  elements  of  T',  a  matrix  which  satisfies 
R  =  TT,  where  R  is  the  target  correlation  matrix. 
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Item  Parameters 

Discrimination  values.  Multidimensional  discrimination  parameters  (i.e., 
MDISC,  see  Equation  8)  were  randomly  chosen  from  a  lognormal  distribution 
with  mean  1 .13  and  standard  deviation  .60.  Most  of  the  values  of  MDISC  were 
between  .5  and  2.5.  Ackerman  (1988)  estimated  the  parameters  of  the  M2PL 
model  for  the  ACT  Assessment  Mathematics  Usage  Test  using  the 
multidimensional  IRT  computer  program  MIRTE  (Carlson,  1987).  Ackerman 
reported  that  MDISC  ranged  from  .58  through  2.39  ,  and  ai  and  82  ranged 
between  0  and  2.00. 

Difficultv  values.  Multidimensional  item  difficulty  parameters  (i.e.,  MID,  see 
Equation  5)  were  chosen  from  a  normal  distribution  with  mean  of  .00  and 
standard  deviation  of  1 .00  by  using  the  function  RANNOR  in  SAS.  The  model 
(Equation  4)  can  be  written  in  a  form  more  similar  to  the  usual  two-parameter 
logistic  model;  in  this  form  the  exponent  of  Equation  4  is  expressed  as 

n 

^  ajk  (9jk  -  bik) 
k=1 

where  n  is  the  number  of  dimensions,  ajk  is  an  element  of  a;,  9jk  is  an  element  of 
6j,  and  dj  =  -  E  aik  bjk  .  From  Equations  4  and  5  and  the  preceding  expression, 
dj  is  related  to  bj  through  the  following  equation: 
n 

di  =  -  Z  ajk  bik  =  -  (MIDi)(MDISCi)  (8) 
k=1 

As  Began  and  Yen(1983)  described,  bj  normally  ranges  from  -3.14  through  2.5 
in  published  tests  (Lord,  1968;  Ross,  1966;  Yen, 1982).  Results  of  the  previously 
mentioned  ACT  Mathematics  test  showed  that  MID  ranged  from  -.73  through 
1 .87.  For  this  study,  the  MID  values  were  randomly  chosen  from  a  normal 
distribution  with  the  range  from  -2  through  2. 
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Item  direction.  The  direction  of  an  item  in  terms  of  direction  cosines  is 
expressed  in  Equation  6.   In  a  two  dimensional  space,  if  an  item  measures  only 
Trait  1 ,  then  oqi  is  0°.  If  an  item  measures  only  Trait  2,  in  a  two-dimensional 
space,  cxji  is  90°.  The  value  a^^  can  be  any  value  from  0°  through  90° 
depending  on  the  degree  to  which  the  item  measures  the  two  traits.  If  a^^  =  45°, 
for  example,  the  item  measures  both  Trait  1  and  Trait  2  equally.  In  MD  1 ,  all  but 
5%  (10%  or  20%)  of  the  items  had  oqi  =  0°.  This  is  a  situation  in  which  all  the 
items  measure  Trait  1  and  only  a  small  percentage  of  items  measure  both  Trait 
1  and  Trait  2.  When  10%  of  the  items  (i.e.,  four  items)  were  multidimensional, 
Oil  was  1 5°  for  the  first  of  the  four  items,  30°  for  the  second,  45°  for  the  third,  and 
60°  for  the  fourth.  Without  loss  of  generality,  these  were  the  last  four  Items  on 
the  test.  When  20%  of  the  items  were  multidimensional,  the  item  parameters  for 
items  33,  34,  35,  and  36  were  equal  to  those  for  items  37,  38,  39,  and  40 
respectively.  The  item  parameters  for  these  latter  four  items  were  equal  to  those 
for  multidimensional  items  in  the  test  in  which  10%  of  the  items  were 
multidimensional.  When  5%  of  the  items  were  multidimensional,  the  item 
parameters  for  items  39  and  40  were  equal  to  the  item  parameters  for  items  39 
and  40  in  the  test  in  which  10%  of  the  items  were  multidimensional.  Items  37 
and  38  had  item  parameters  equal  to  those  for  items  1  and  2.  Item  parameter 
specifications  for  the  tests  of  type  MD  1  with  10%  of  the  items  being 
multidimensional  are  shown  in  Table  5.  In  MD  2,  two  a's  had  to  be  specified  to 
determine  an  item  vector  in  the  three-dimensional  space.  For  items  which 
measured  only  the  first  two  traits,  a^^  was  systematically  assigned  one  of  the 
four  values  (15°,  30°,  45°,  and  60°)  to  simulate  situations  in  which  an  item 
measures  mostly  Trait  1 ,  measures  Trait  1  and  2  equally,  or  measures  mostly 
Trait  2.  When  10%  of  the  items  (four  items)  measured  three  traits,  one  item  had 
ai3  =  75°,  one  60°,  one  45°,  and  one  30°.  For  each  of  these  four  items,  cqi  and 


Table  5 

Item  Parameters  in  MP  1  with  10%  Twn-Dimensional  Items  (Last  4  Items) 


ltem# 

OCil 

ai2 

MDIS 

MID 

ail 

ai2 

d 

1 

0 

90 

0.89 

0.02 

0.89 

0.00 

-0.01 

2 

0 

90 

1.20 

0.76 

1.20 

0.00 

-0.91 

3 

0 

90 

0.88 

1.84 

0.88 

0.00 

-0.61 

4 

0 

90 

2.07 

-0.15 

2.07 

0.00 

0.32 

5 

0 

90 

1.00 

-0.75 

1.00 

0.00 

0.74 

6 

0 

90 

2.10 

0.86 

2.10 

0.00 

-1.80 

7 

0 

90 

1.76 

-0.52 

1.76 

0.00 

0.92 

8 

0 

90 

1.40 

0.43 

1.40 

0.00 

-0.61 

9 

0 

90 

0.62 

0.34 

0.62 

0.00 

-0.21 

10 

0 

90 

1.07 

-0.77 

1.07 

0.00 

0.83 

11 

0 

90 

1.28 

1.25 

1.28 

0.00 

-1.59 

12 

0 

90 

0.81 

-0.06 

0.81 

0.00 

0.05 

13 

0 

90 

3.12 

1.10 

3.12 

0.00 

-3.42 

14 

0 

90 

0.61 

-0.32 

0.61 

0.00 

0.20 

15 

0 

90 

0.60 

0.35 

0.60 

0.00 

-0.21 

16 

0 

90 

1.25 

0.32 

1.25 

0.00 

-0.40 

17 

0 

90 

1.08 

-1.15 

1.08 

0.00 

1.24 

18 

0 

90 

2.44 

-0.1 1 

2.44 

0.00 

0.26 

19 

0 

90 

1.76 

-0.24 

1.76 

0.00 

0.41 

20 

0 

90 

0.94 

0.33 

0.94 

0.00 

-0.31 

21 

0 

90 

0.73 

-0.68 

0.73 

0.00 

0.50 

22 

0 

90 

0.95 

-0.85 

0.95 

0.00 

0.81 

23 

0 

90 

1.52 

0.50 

1.52 

0.00 

-0.76 

24 

0 

90 

1.39 

1.67 

1.39 

0.00 

-2.33 

25 

0 

90 

1.77 

-1.59 

1.77 

0.00 

2.82 

26 

0 

90 

0.58 

-0.05 

0.58 

0.00 

0.03 

27 

0 

90 

0.53 

-1.06 

0.53 

0.00 

0.56 

28 

0 

90 

1.22 

0.16 

1.22 

0.00 

-0.20 

29 

0 

90 

0.63 

0.69 

0.63 

0.00 

-0.44 

30 

0 

90 

0.58 

0.58 

0.58 

0.00 

-0.34 

31 

0 

90 

0.71 

0.28 

0.71 

0.00 

-0.20 

32 

0 

90 

0.53 

0.34 

0.53 

0.00 

-0.18 

33 

0 

90 

1.04 

1.18 

1.04 

0.00 

-1.65 

34 

0 

90 

1.53 

0.09 

1.53 

0.00 

-0.14 

35 

0 

90 

0.60 

-0.84 

0.60 

0.00 

0.51 

36 

0 

90 

1.26 

-0.13 

1.26 

0.00 

0.16 

37 

15 

75 

0.73 

0.12 

0.70 

0.19 

-0.09 

38 

30 

60 

1.07 

1.38 

0.93 

0.53 

-1.48 

39 

45 

45 

1.75 

0.14 

1.24 

1.24 

-0.24 

40 

60 

30 

0.94 

1.28 

0.47 

0.82 

-1.20 

35 

aj2  were  set  equal  to  one  another  and  selected  so  that  an ,  a\2,  and  013  met  the 
constraint 
n 

X  C0S2  Oik  =  1  (9) 
k-1 

Table  6  shows  the  item  parameters  for  the  tests  of  type  MD  2.  After  setting 
values  on  MDISC,  MID,  and  a,  values  for  ajk  and  dj  were  determined  by 

Equations  6  and  8. 

Item  data.  For  each  condition,  response  vectors  were  generated  for  each 
subgroup  for  tests  of  40  Items.  Probabilities  of  answering  an  item  correctly, 
P(xij  =  1  I  aj,  dj,  Bj ),  were  calculated  for  each  item  i  and  each  simulee  j  by  using 
the  M2PL  model  by  Reckase  (1986).  For  each  (I,  j)  pair  a  random  number  from 
0  to  1  was  generated  from  a  uniform  distribution  using  the  function  RANUNI  in 
SAS.  If  the  random  number  was  less  than  or  equal  to  P(xjj  =  1 1  Sj,  dj,  9j ), 
simulee  j  passed  item  i;  otherwise,  simulee  j  failed  item  i. 

Analvsis 

Parameter  Estimation 

The  responses  of  1000  simulees  from  each  subgroup  to  the  40  items  in  each 
condition  were  analyzed  by  the  unidimensional  IRT  calibration  program  PC- 
BILOG  (Mislevy  &  Bock,  1 985).  The  item  parameters  of  the  two-parameter 
logistic  model  were  estimated  using  the  default  priors:  a  log-normal  prior  on  the 
discrimination  estimates  and  ho  prior  on  the  difficulty  estimates. 
Item  Bias  Detection  and  Its  Accuracv 

The  item  parameter  estimates  from  PC-BILOG  were  analyzed  by  using  SAS 
for  scaling  and  to  compute  an  item  parameter  invariance  index.  The  index  used 
in  the  study  was  the  unsigned  sum  of  squared  differences  (USOS)  between  the 
ICCs  of  a  given  item  based  on  the  item  parameter  estimates  of  two  subgroups 


Table  6 

Item  Parameters  in  MP  2  with  10%  Thrfifi-Dimfinsional  Items  (Last  4  Items) 


ltem#  Oil 

a\2 

ai3 

MDISK  MID 

aji 

ai2 

3(3 

d 

1 

15 

75 

90 

0.43 

1.22 

0.42 

0.11 

0.00 

-0.52 

2 

30 

60 

90 

1.22 

1.34 

1.04 

0.60 

0.00 

-1.61 

3 

45 

45 

90 

0.64 

-0.95 

0.45 

0.45 

0.00 

0.61 

4 

60 

30 

90 

0.85 

0.97 

0.43 

0.74 

0.00 

-0.82 

5 

15 

75 

90 

0.48 

-0.31 

0.46 

0.12 

0.00 

0.15 

6 

30 

60 

90 

0.98 

0.16 

0.85 

0.49 

0.00 

-0.16 

7 

45 

45 

90 

0.72 

-0.51 

0.51 

0.51 

0.00 

0.37 

8 

60 

30 

90 

1.65 

-1.41 

0.83 

1.43 

0.00 

2.33 

9 

15 

75 

90 

0.74 

-0.69 

0.71 

0.19 

0.00 

0.51 

10 

30 

60 

90 

2.05 

1.29 

1.77 

1.02 

0.00 

-2.64 

11 

45 

45 

90 

1.43 

-0.81 

1.01 

1.01 

0.00 

1.16 

12 

60 

30 

90 

0.61 

-0.05 

0.31 

0.53 

0.00 

0.03 

13 

15 

75 

90 

1.43 

1.04 

1.38 

0.37 

0.00 

-1.49 

14 

30 

60 

90 

1.47 

-0.57 

1.28 

0.74 

0.00 

0.84 

15 

45 

45 

90 

1.01 

1.08 

0.72 

0.72 

0.00 

-1.09 

16 

60 

30 

90 

1.23 

0.54 

0.61 

1.06 

0.00 

-0.66 

17 

15 

75 

90 

0.53 

-0.71 

0.52 

0.14 

0.00 

0.38 

18 

30 

60 

90 

0.86 

0.34 

0.74 

0.43 

0.00 

-0.29 

19 

45 

45 

90 

1.99 

-1.21 

1.41 

1.41 

0.00 

2.41 

20 

60 

30 

90 

1.15 

0.18 

0.58 

1.00 

0.00 

-0.21 

21 

15 

75 

90 

0.73 

0.30 

0.70 

0.19 

0.00 

-0.22 

22 

30 

60 

90 

1.77 

-1.58 

1.53 

0.88 

0.00 

2.80 

23 

45 

45 

90 

0.66 

-1.19 

0.47 

0.47 

0.00 

0.79 

24 

60 

30 

90 

0.89 

1.38 

0.44 

0.77 

0.00 

-1.23 

25 

15 

75 

90 

0.86 

0.37 

0.83 

0.22 

0.00 

-0.32 

26 

30 

60 

90 

0.41 

-0.08 

0.36 

0.21 

0.00 

0.03 

27 

45 

45 

90 

0.74 

1.84 

0.52 

0.52 

0.00 

-1.36 

28 

60 

30 

90 

1.10 

0.86 

0.55 

0.95 

0.00 

-0.95 

29 

15 

75 

90 

1.47 

0.55 

1.42 

0.38 

0.00 

-0.81 

30 

30 

60 

90 

1.31 

0.07 

1.13 

0.65 

0.00 

-0.09 

31 

45 

45 

90 

0.87 

1.02 

0.61 

0.61 

0.00 

-0.89 

32 

60 

30 

90 

0.45 

-1.58 

0.22 

0.39 

0.00 

0.71 

33 

15 

75 

90 

1.18 

1.23 

1.14 

0.30 

0.00 

-1.45 

34 

30 

60 

90 

0.73 

-1.10 

0.63 

0.36 

0.00 

0.80 

35 

45 

45 

90 

0.93 

-0.39 

0.66 

0.66 

0.00 

0.36 

36 

60 

30 

90 

1.47 

-0.50 

0.73 

1.27 

0.00 

0.74 

37 

47 

47 

75 

2.56 

0.46 

1.75 

1.75 

0.66 

-1.18 

38 

52 

52 

60 

1.38 

-0.59 

0.84 

0.84 

0.69 

0.81 

39 

60 

60 

45 

0.58 

0.72 

0.29 

0.29 

0.41 

-0.42 

40 

69 

69 

30 

1.83 

-0.54 

0.65 

0.65 

1.58 

0.99 

37 

(See  Miller  and  Linn,  1988).  The  USOS  index  was  computed  by  summing  the 
squared  difference  between  the  ICCs  at  the  midpoint  of  each  of  600  intervals  of 
.01  between  9  =  -3.00  and  3.00: 

+2.995  2 

USOS  =  E    (Pi(e)-P2(e))  (io) 

e=  -2.995 

Baselines 

A  baseline  method  was  employed  to  determine  the  occurrence  of  biased 
items.  That  is,  the  baseline  groups  were  generated  by  replicating  a  reference 
group  while  item  parameters  and  trait  distributions  were  held  constant.  For 
example,  for  a  condition  in  which  a  reference  group  and  a  focal  group 
(generated  to  represent  the  condition)  are  compared,  for  each  of  the  40  items  a 
USOS  statistic  was  calculated  for  the  baseline  condition.  In  the  baseline 
condition,  the  values  of  Pi(e)  and  P2(9)  were  estimated  from  ICCs  generated  for 
two  random  samples  of  1000  examinees  with  identical  trait  and  parameter 
distributions.  The  mean  and  the  standard  deviation  of  the  USOS  statistics  over 
the  40  items  were  calculated  to  establish  the  criterion  for  determining  a  biased 
item.  The  "rule  of  thumb"  is  to  reject  the  item  parameter  invariance  hypothesis  if 
the  USOS  statistic  for  items  exceeds  the  baseline  mean  plus  two  standard 
deviations.  This  process  was  repeated  five  times  with  five  replicated  baselines 
under  each  condition  so  that  the  fluctuation  of  the  baseline  criterion  values 
would  be  taken  into  account.  Under  each  experimental  condition,  for  each  item, 
a  USOS  statistic  was  calculated  for  the  reference  and  focal  groups.  The  USOS 
statistic  was  compared  to  each  of  the  five  baseline  mean  USOS  to  see  if 
unbiased  items  were  incorrectly  identified  as  biased  (false  positives)  in 
conditions  under  no  bias,  and  if  biased  items  were  successfully  identified  as 
biased  (true  positives)  in  conditions  with  bias  on  certain  items. 


38 


As  noted  in  Tables  1  and  2,  in  MD  1  all  the  reference  groups  had  the  same 
trait  distributions,  SjB  =  0.  There  were  three  sets  of  item  parameters  depending 
on  the  percentage  of  the  multidimensional  items  (5%,  10%,  and  20%). 
Therefore,  three  baselines  were  needed  for  three  different  sets  of  item 
parameters.  As  noted  in  Tables  3  and  4,  in  MD  2  all  the  reference  groups  had 
the  same  set  of  item  parameters,  but  there  were  three  different  types  of  trait 
distributions  (rei,G2  =  0,  .5,  and  .8,  all  with  SjB  =  0).  Therefore,  again,  three 
baselines  were  necessary  for  three  different  types  of  trait  distributions.  Three 
conditions  in  MD  1  and  three  conditions  in  MD  2  served  as  baselines  due  to 
their  identical  item  parameters  and  trait  distributions  between  the  two  groups 
(see  Tables  1  through  4).  Then,  the  USOS  statistics  resulting  from  other  42 
conditions  were  compared  with  those  from  each  appropriate  baseline 
depending  on  the  percentage  of  the  multidimensional  items  of  the  reference 
group  (for  MD  1 ),  or  depending  on  the  degree  of  the  correlation  rei,e2  of  the 
reference  group  (for  MD  2) , 

Summary 

Data  for  two  types  of  multidimensionality  structures  were  simulated.  The  first 
type,  MD  1 ,  had  one  relevant  trait  and  one  irrelevant  trait.  The  second  type,  MD 
2,  had  two  relevant  traits  and  one  irrelevant  trait.  In  either  case,  the  irrelevant 
trait  influenced  only  small  number  of  items.  In  each  type  of  dimensionality,  both 
no-bias  and  bias  situations  were  considered.  For  MD  1 ,  the  percentage  of 
multidimensional  items  was  altered  (i.e.,  5%,  10%,  or  20%).  The  between- 
group  mean  difference  on  the  first  trait  was  also  a  factor  of  interest.  For  MD  2, 
the  degree  of  correlation  was  altered  (i.e.,  rei_  92  =  0,  .5  and  .8).  The  between- 
group  mean  difference  and/or  the  between-group  correlation  difference  on  the 
first  two  traits  constituted  factors  of  interest.  With  the  conditions  mentioned 
above,  42  data  sets  were  generated.  The  42  data  sets,  each  set  having  two 


39 


subgroups,  were  simulated  with  1000  simulees  and  40  items  in  each  subgroup. 
These  simulated  data  sets  were  then  calibrated  by  PC-BILOG,  and  each  item 
was  examined  for  item  bias  using  an  item  parameter  invariance  index,  USOS. 
Finally,  the  occurrence  of  false  positives  under  the  no-bias  situation  and  the 
bias  situation  was  examined.  In  the  bias  situation,  the  true-positive  detection 
rate  was  also  examined. 


CHAPTER  4 
RESULTS  AND  DISCUSSION 


In  this  study,  data  for  two  different  types  of  multidimensional  structures  were 
simulated  and  analyzed  with  both  no-bias  and  bias  situations.  The  results  of 
the  analyses  have  been  organized  into  two  sections  according  to  the  types  of 
multidimensionality.  In  each  section,  results  for  the  no-bias  and  the  bias 
situations  are  reported.  In  addition,  the  results  of  analyses  examining  the 
stability  of  multiple  baseline  conditions  are  provided  in  the  third  section. 

Multidimensional  Structure  1 

In  the  first  type  of  multidimensional  structure  (MD  1 ),  the  test  was  dominantly 
unidimensional  with  only  a  small  percentage  of  items  being  influenced  by  an 
additional  irrelevant  trait.  Factors  of  interest  were  (a)  the  between-group  mean 
difference  on  the  relevant  trait  (Sia-  9ib  =  0  or  .5)  and  (b)  the  percentage  of 
items  influenced  by  the  irelevant  trait  (5%,  10%,  or  20%).  The  effects  of  the 
above  two  factors  were  studied  in  the  no-bias  situation  (92A-  02B  =  0,  indicating 
no  between-group  difference  on  the  irrelevant  trait)  and  in  the  bias  situation 
(92A-02B  =  -5,  indicating  a  between-group  mean  difference  on  the  in-elevant 
trait).  For  each  experimental  condition,  test  data  were  simulated  for  two  groups 
of  1000  examinees  each  on  a  40-item  test.  The  resulting  data  were  analyzed 
by  using  PC-BILOG  to  obtain  item  parameter  and  ability  estimates.  Item  bias 
detection  was  then  performed  using  the  USDS  item  parameter  invariance  index 
and  a  baseline  method  described  in  an  earlier  chapter.  For  each  condition,  five 
replicates  of  the  baseline  data  also  were  simulated. 
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No-Bias  Situation 

Table  7  contains  the  results  for  Conditions  1  through  3  (see  Table  1  for  the 
description  of  each  condition).  In  these  conditions,  there  was  a  mean 
difference  between  reference  and  focal  groups  on  the  dominant  trait  (i.e., 
6iA  -  SiB  =  -5).  For  each  condition,  the  number  of  false  positives  is  provided.  A 
false  positive  is  an  unbiased  item  with  a  USOS  that  exceeded  the  criterion 
established  for  bias.  The  USOS  criterion  values  of  the  three  baseline 
conditions  over  the  five  replications  in  MD  1  are  reported  in  Table  8.  For  each 
multidimensional  item,  the  number  of  false  positive  occurrences  over  the  five 
baseline  replications  is  displayed  in  Table  9. 

From  the  results  in  Table  7  one  can  see  that  while  the  number  of  false 
positives  in  the  condition  where  5%  of  the  items  were  multidimensional  was 
rather  high,  no  false  positives  or  only  a  few  false  positives  occurred  in  the 
conditions  where  10%  or  20%  of  the  items  were  multidimensional.  In  fact, 
contrary  to  what  was  expected,  the  number  of  false  positives  decreased  as  the 
percentage  of  multidimensional  items  increased.  In  addition,  the  number  of 
false  positives  varied  within  a  condition  depending  on  the  replication  of  the 
baseline  condition.  The  variation  of  the  number  of  false  positives  can  be 
attributed  to  the  fluctuation  of  the  criterion  values  over  the  five  baseline 
replications  (see  Table  8).  The  results  in  Table  9  indicate  that  false  positives 
did  not  necessarily  occur  on  the  multidimensional  items.  In  concert  the  results 
from  Tables  7  and  9  imply  that  when  a  small  number  of  items  are 
multidimensional,  using  an  IRT-based  item  detection  technique  with  the 
baseline  method  will  not  always  result  in  excessive  false  positive  rates  simply 
because  the  reference  group  and  focal  group  have  different  means  on  the 
relevant  trait. 
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Table  7 

Frflquencv  of  Occurrenca  of  False  Positives  in  thp  Nn-Bias  Situation  in  MP  1 


Baselines  Replication 


Condition 

%a 

1 

2 

3 

4 

5 

M 

ai2 

1 

5% 

2 

7 

5 

13 

7 

6.8 

4.02 

2 

10% 

4 

0 

0 

0 

7 

2.2 

3.19 

3 

20% 

3 

0 

0 

0 

3 

1.2 

1.64 

Note.  Conditions  1  -  3  refer  to  the  conditions  with  a  between-group  mean 
difference  on  Trait  1 . 

a  Percentage  of  items  that  are  multidimensional. 
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Table  8 

USPS  Criterion  Values  in  MP  1 


Baselines  Replication 

Condition 

%a 

12       3  4 

5 

M 

SSI 

B1 

5% 

2.30    1.43    1.68  0.99 

1.51 

1.58 

0.47 

B2 

10% 

1.62    1.87    1.89  2.05 

1.10 

1.72 

0.33 

B3 

20% 

2.11    2.31    2.78  2.65 

1.83 

2.34 

0.39 

Note.  The  criterion  values  are  the  mean  USOS  plus  two  standard  deviations  in 
a  baseline  condition. 

a  Percentage  of  items  that  are  multidimensional. 
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Table  9. 

Frequency  of  Occurrence  of  False  Positives  for  Fanh  Multidimensional  Item  in 
Five  Baseline  Replications 

Item  #  for  the  Multidimensional  Items 
Condition     %a         33     34      35      36      37      38      39  40 

~1  5%  "      1  4 

2  10%  2       0       0  0 

3  20%  00020000 


Note.  Conditions  1  -  3  refer  to  the  conditions  with  a  between-group  mean 
difference  on  Trait  1 . 

3  Percentage  of  items  that  are  multidimensional. 


45 

Bias  Situation 

Tables  10  and  1 1  contain  results  for  Conditions  4  through  6  where  the 
reference  and  focal  groups  had  equal  means  on  the  relevant  trait,  but  unequal 
means  on  the  Irrelevant  trait  (see  Table  2).  These  tables  also  contain  results  for 
Conditions  7  through  9  where  the  reference  and  focal  groups  had  different 
means  on  both  traits.  The  occurrence  of  false  positives  is  reported  in  Table  10. 
The  occurrence  of  true  positives  is  reported  in  Table  1 1 .  From  the  results 
shown  for  Conditions  4  through  6  in  Table  10,  one  can  see  that  the  occun'ence 
of  false  positives  was  comparable  to  that  in  the  no-bias  situation  in  Table  7.  The 
occurrence  of  false  positives  was  rather  high  when  5%  of  the  items  were 
multidimensional,  depending  on  the  replication  of  the  baseline  condition; 
however,  no  false  positives  or  a  small  number  of  false  positives  were  identified 
when  10%  or  20%  of  the  items  were  multidimensional.  The  percent  of  biased 
items  detected  correctly  (the  true-positive  detection  rate)  was  100%  when  5%  of 
the  items  were  multidimensional.  The  detection  rate  for  the  10%  and  20% 
multidimensional  conditions  were  lower  (50%  and  58%  respectively).  These 
latter  rates  suggest  the  possibility  that  the  USOS  index  may  not  be  sufficiently 
sensitive  unless  the  number  of  biased  items  is  small.  From  the  results  for 
Conditions  7  through  9  in  Tables  10  and  1 1 ,  the  between-group  mean 
difference  on  the  relevant  trait  appears  to  have  no  evident  effect  on  the 
occurrence  of  false  positives  or  on  the  detection  of  biased  items. 

Multidimensional  Structure  2 

In  the  second  type  of  multidimensional  structure  (MD  2),  there  were  three 
traits:  two  relevant  traits  (Trait  1  and  Trait  2)  which  affected  all  the  items  and 
one  irrelevant  trait  (Trait  3)  which  affected  only  10%  of  the  items.  Factors  of 
interest  were  (a)  the  between-group  mean  difference  on  the  relevant  traits  and 
(b)  the  between-group  difference  in  the  correlation  between  the  relevant  traits. 


Table  10 

Frequency  of  Occurrence  of  False  Positives  in  the  Bias  Situation  in  MP  1 


Baseline  Replications 


Condition  %a 

1 

2 

3 

4 

5 

M 

4  5%(38)b 

0 

5 

4 

12 

5 

5.2 

4.32 

5        1 0%(36) 

2 

2 

2 

1 

4 

2.4 

1.14 

6  20%(32) 

0 

0 

0 

0 

0 

0.0 

0.00 

7  5%(38) 

1 

7 

5 

a 

6 

5.4 

2.70 

8  10%(36) 

2 

1 

1 

0 

3 

1.4 

1.14 

9  20%(32) 

1 

1 

0 

1 

1 

0.8 

0.45 

Note.  Conditions  4  -  6  refer  to  the  conditions  with  no  between-group  mean 
difference  on  Trait  1 .  Conditions  7  -  9  refer  to  the  conditions  with  a  between- 
group  mean  difference  on  Trait  1 . 
a  Percentage  of  items  that  are  multidimensional. 
^  ( )  --  The  number  of  unbiased  items  on  a  40-item  test 


\ 


Table  1 1 

Frequency  of  Occurrence  of  True  Positives  in  the  Bias  Situation  in  MP  1 


Baseline  Replications 


Condition  %a 

1 

2 

3 

4 

5 

M 

4  5%(2)c 

2 

2 

2 

2 

2 

2.0 

0.00  100% 

5  10%(4) 

2 

2 

2 

1 

3 

2.0 

0.71  50% 

6  20%(8) 

5 

5 

4 

4 

5 

4.4 

0.55  58% 

7  5%(2) 

2 

2 

2 

2 

2 

2.0 

0.00  100% 

8        1 0%(4) 

2 

2 

2 

2 

2 

2.0 

0.00  50% 

9  20%(8) 

3 

3 

3 

3 

3 

3.0 

0.00  38% 

Note.  Conditions  4  -  6  refer  to  the  conditions  with  no  between-group  mean 
difference  on  Trait  1 .  Conditions  7  -  9  refer  to  the  conditions  with  a  between- 
group  mean  difference  on  Trait  1 . 
a  Percentage  of  items  that  are  multidimensional, 
b  Percent  of  biased  items  detected  correctly, 
c  ( )  -  The  number  of  biased  items  on  a  40-item  test. 
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Here  again,  the  results  were  reported  separately  for  the  no-bias  situation  and 
the  bias  situation.  The  procedures  for  simulating  and  analyzing  the  data  were 
the  same  as  the  multidimensional  structure  1  described  above,  except  that  here 
the  multidimensional  structure  2  was  the  structure  of  the  test. 
No-Bias  Situation 

Table  12  contains  the  occurrence  of  false  positives  in  all  15  conditions  in  the 
no-bias  situation  (see  Table  3).  The  USOS  criterion  values  of  the  three 
baseline  conditions  over  the  five  replications  in  MD  2  are  reported  in  Table  13. 
Figures  3  through  7  are  pictorial  presentations  of  Conditions  1  through  3,  4 
through  6,  7  through  9, 1 0  through  1 2,  and  1 3  through  1 5  in  Table  1 2, 
respectively. 

The  first  factor  of  interest  was  the  between-group  mean  difference  on  the 
dominant  trait(s).  With  two  dominant  traits,  there  can  be  between-group 
differences  on  either  or  both  traits.  For  Conditions  1  through  3  the  highest  mean 
number  of  the  occurrence  of  false  positives  ocurred  when  rei,  92  =  -5,  the  next 
highest  when  rei,  e2  =  0,  and  the  lowest  when  rei,  92  =  -S-  The  number  of  false 
positives  varied  within  a  condition  depending  on  the  replication  of  the  baseline 
condition.  The  variation  was  again  due  to  the  fluctuation  of  the  criterion  values 
(see  Table  13).  Across  the  three  conditions,  the  average  number  of  items 
identified  as  biased  was  4.5.  Thus  the  occurrence  of  false  positives  was  not 
substantial  when  the  means  differed  on  the  two  dominant  traits.  Similar  results 
were  obtained  for  Conditions  4  through  6  (see  Table  12  and  Figure  4)  in  which 
there  was  a  between-group  mean  difference  on  only  Trait  1 .  Across  the  three 
conditions,  the  average  number  of  items  identified  as  biased  was  2.9.  The 
results  for  Conditions  1  through  6  in  Table  12  are  also  similar  to  those  for 
Conditions  1  through  3  in  Table  7  in  which  the  effect  of  the  between-group 
difference  on  the  dominant  trait  was  examined  in  MD  1  where  the  test  was 


Table  12 

Frequency  of  Occurrence  of  False  Positives  in  thfi  No-Bias  Situation  in  MP  2 


Baseline  Replications 
Condition     rei.ez  1       2      3      4       5~  M 


1 

.0  vs. .0 

2 

2 

2 

6 

6 

3.6 

2.19 

2 

.5  vs. .5 

10 

9 

7 

8 

1 1 

9.0 

1.58 

3 

.8  vs. .8 

1 

1 

1 

1 

1 

1.0 

0.00 

4 

.0  vs.  .0 

0 

1 

1 

4 

7 

2.6 

2.87 

5 

.5  vs. .5 

5 

5 

5 

5 

5 

5.0 

0.00 

6 

.8  vs. .8 

2 

1 

1 

1 

1 

1.2 

0.45 

7 

.5  vs.  .8 

6 

6 

5 

6 

6 

5.8 

0.50 

8 

.0  vs. .5 

8 

12 

13 

16 

19 

13.6 

4.16 

9 

.0  vs. .8 

14 

22 

22 

28 

29 

23.0 

6.00 

10 

.5  vs. .8 

15 

14 

1 1 

11 

15 

13.2 

2.05 

11 

.0  vs. .5 

13 

26 

26 

35 

35 

27.0 

9.02 

12 

.0  vs. .8 

20 

27 

27 

35 

36 

29.0 

6.60 

13 

.5  vs. .8 

5 

4 

3 

4 

5 

4.2 

0.84 

14 

.0  vs.  .5 

12 

20 

20 

30 

31 

22.6 

7.92 

15 

.0  vs.  .8 

13 

22 

22 

31 

32 

24.0 

7.78 

Note.  Conditions  1  -  3  and  10-12  refer  to  the  conditions  with  between-group 
mean  differences  on  Trait  1  and  Trait  2;  Conditions  4  -  6  and  13-15  refer  to 
conditions  with  a  between-group  mean  difference  on  Trait  1 ;  Conditions  7  -  9 
refer  to  conditions  with  no  between-group  differences. 
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Table  13 

USPS  Criterion  Values  in  MP  2 


Baselines  Replication 

Condition 

i'ei,e2 

1 

2       3       4  5 

M 

an 

B1 

.0 

3.09 

2.20    2.14    1.46  1.36 

2.05 

0.70 

B2 

.5 

1.44 

1.48    1.69    1.56  1.41 

1.52 

0.11 

B3 

.8 

1.68 

1.91    2.34    1.92  1.77 

1.93 

0.25 

Note.  The  criterion  values  are  the  mean  USOS  plus  two  standard  deviations  in 
a  baseline  condition. 


Figure  3.  Occurrence  of  false  positives  in  Conditions  1  -  3  (  between-group 
mean  differences  on  Trait  1  and  Trait  2) 


Figure  4.  Occurrence  of  false  positives  in  Conditions  4  -  6  (a  between-group 
difference  on  Trait  1 ) 


Figure  5.  Occurrence  of  false  positives  in  Conditions  7  -  9  (a  between-group 
correlation  difference) 


Figure  6.  Occurrence  of  false  positives  in  Conditions  10  -12  (a  between- 
correlation  difference  as  well  as  mean  differences  on  Trait  1  and  Trait  2) 


Figure  7.  Occurrence  of  false  positives  in  Conditions  13  -15  (a  between-group 
correlation  difference  as  well  as  a  mean  difference  on  Trait  1) 
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dominantly  unidimensional.  These  results  suggest  that  item  bias  detection 
using  USOS  and  the  baseline  method  is  likely  to  maintain  the  useful 
characteristic  of  no  confounding  between  measurement  bias  and  between- 
group  difference  even  when  the  unidimensionality  assumption  is  violated. 

The  second  factor  of  interest  was  the  between-group  correlation  difference. 
In  Conditions  1  through  6,  the  correlation  rei,  e2  was  the  same  in  both  groups. 
In  Conditions  7  through  9,  the  two  groups  had  exactly  the  same  means  on  all 
the  three  traits,  but  the  correlations  between  the  first  two  traits  varied  between 
groups.  The  magnitude  of  the  difference  in  the  correlation  had  three  levels: 
.5  vs.  .8,  0  vs.  .5,  and  0  vs.  .8  for  the  reference  and  focal  groups  respectively. 
The  results  provided  in  Table  12  and  Figure  5  for  Conditions  7  through  9, 
indicate  a  strikingly  high  occurrence  of  false  positives.  Across  the  three 
conditions,  the  average  number  of  items  incorrectly  identified  as  biased  was 
14.1 .  In  the  most  extreme  case,  29  out  of  40  items  were  identified  as  biased. 
The  results  for  Conditions  7  through  9  in  Table  12  also  form  an  interesting 
pattern.  The  number  of  false  positives  increased  steadily  as  the  difference  in 
the  correlation  between  the  groups  increased.  When  the  absolute  value  of  the 
correlation  difference  was  .8,  the  number  of  false  positives  ranged  from  14 
through  29  with  the  mean  of  23.0  items  over  the  five  replicated  baselines.  Also 
noteworthy  are  the  standard  deviations  which  are  rather  high  in  Conditions  8 
and  9  when  compared  with  Conditions  1  and  4  in  which  the  same  set  of 
baselines,  Baseline  1 ,  was  used.  This  indicates  that  many  of  the  USOS  in 
Conditions  8  and  9  were  the  values  around  the  criterion  values  and  that  the 
fluctuation  of  the  baselines  was  substantially  influential  in  identifying  bias.  On 
the  other  hand,  in  Conditions  1  and  4,  most  of  the  USOS  values  were  small  and 
distant  from  the  criterion  values,  and  therefore  the  fluctuation  of  the  baselines 
did  not  have  as  much  effect  as  in  Conditions  8  and  9. 


In  the  remaining  six  conditions  in  the  no-bias  situation  (Conditions  10 
through  15),  both  the  between-group  mean  difference  and  the  between-group 
correlation  difference  were  considered  simultaneously.  In  Conditions  10 
through  12,  there  were  between-group  mean  differences  on  both  Trait  1  and 
Trait  2,  coupled  with  the  between-group  correlation  difference  with  three  levels, 
I'ei,  62  =  -5  vs.  .8,  0  vs.  .5,  and  0  vs.  .8.  In  Conditions  1 3  through  1 5,  there  was  a 
between-group  mean  difference  on  only  Trait  1 ,  coupled  with  the  same  three 
levels  of  the  between-group  correlation  difference  as  in  Conditions  10  through 
12.  Here,  the  results  were  consonant  with  those  from  Conditions  7  through  9  In 
which  there  was  a  between-group  correlation  difference  alone.  In  both 
Conditions  10  through  12,  and  13  through  15,  a  strikingly  large  number  of  items 
were  incorrectly  identified  as  biased  (see  Table  12  and  Figures  6  and  7).  The 
average  number  of  false  positives  was  23.0  items  in  Conditions  10  through  12, 
and  16.9  items  in  Condition  13  through  15.  In  the  most  extreme  case,  36  out  of 
40  items  had  USOS  values  that  exceeded  the  criterion  value.  In  other  words  90 
%  of  the  items  on  the  test  were  incorrectly  identified  as  biased.  Here  again, 
rather  high  values  of  the  standard  deviations  were  observed  on  Conditions  1 1 , 
12, 14,  and  15.  These  findings  for  Conditions  10  through  15  suggest  that  the 
group  mean  difference  can  exacerbate  the  occurrence  of  false  positives  when 
the  coH'elations  between  the  relevant  traits  are  different  between  two  groups. 

Figure  8  is  an  illustration  of  the  occurrence  of  the  false  positives  in  all  the 
conditions  in  the  no-bias  situation.  In  Figure  8,  various  combinations  of 
Conditions  1  through  15  have  been  combined  by  taking  the  means  to  delineate 
the  effect  of  (a)  between-group  mean  differences  on  Trait  1  and  Trait  2 
(Conditions  1  through  3),  (b)  a  between-group  mean  difference  on  only  Trait  1 
(Conditions  4  through  6),  (c)  a  between-group  correlation  difference  (Conditions 
7  through  9),  (d)  the  combination  of  (a)  and  (c)  (Conditions  10  through  12),  and 
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(a)  mean  difference  (two  traits) 

(b)  mean  difference  (one  trait) 

(c)  correlation  difference 

(d)  correlation  &  mean  difference  (two  traits) 

(e)  correlation  &  mean  difference  (one  trait) 
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Baselines 


Figure  8.  Summaty  of  occurrence  of  false  positives  in  Conditions  1  - 15  In  the 
no-bias  situation  in  MD  2 
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(e)  the  combination  of  (b)  and  (c)  (Conditions  13  through  15).  Also  illustrated  in 
Figure  8  are  the  numbers  of  false  positives  in  the  baselines  to  visually  delineate 
the  expected  number  of  false  positives.  The  40  USOS  values  in  each 
replication  of  the  baselines  were  compared  with  the  criterion  value  of  its  own 
baseline.  Then  the  three  types  of  the  baselines  (B1  through  B3)  were  combined 
together  by  taking  the  mean  number  of  items  which  were  two  standard 
deviations  above  the  mean  USOS.  Figure  8  indicates  the  obvious  occurrence 
of  false  positives  in  (c),  (d),  and  (e).  In  sum,  these  results  in  the  no-bias  situation 
suggest  that  a  substantially  large  number  of  false  positives  can  occur  due  to  the 
between-group  correlation  difference,  and  the  occurrence  of  false  positives  can 
become  more  evident  as  the  difference  in  correlations  become  more  extreme. 
The  results  also  suggest  that,  coupled  with  the  between-group  correlation 
difference,  the  between-group  mean  difference  can  augment  the  occun'ence  of 
false  positives. 

gias  Situation 

In  the  bias  situation  in  MD  2,  the  two  groups  had  different  means  on  the  third 
trait  (63A-  636  =  -5).  This  created  four  biased  items  on  the  test.  The 
experimental  conditions  were  the  same  as  for  the  no-bias  situation  except  the 
mean  difference  of  the  third  trait  (see  Table  4).  The  results  for  all  18  conditions 
in  the  bias  situation,  Conditions  16  through  33,  are  summarized  in  Tables  14 
and  15.  The  occurrence  of  false  positives  is  reported  in  Table  14.  The 
occurrence  of  tme  positives  is  reported  in  Table  15. 

Conditions  16  through  18  are  the  case  in  which  the  test  was  dominantly  two 
dimensional  with  only  the  last  four  items  being  three  dimensional,  and  there 
was  no  between-group  mean  or  correlation  difference  on  the  two  dominant 
traits.  The  results  indicate  that  there  was  no  substantial  occurrence  of  false 
positives  on  the  36  unbiased  items,  and  that  the  true-positive  detection  rate 


Table  14 

Frequency  of  Occurrence  of  False  Positives  on  36  Unbiased  Items  in  the  Bias 
Situation  in  MP  2 


Baseline  Replications 


Condition 

rei,  62 

1 

2 

3 

4 

5 

M 

ai2 

16 

.0  vs. .0 

1 

3 

3 

5 

5 

3.4 

1.67 

1  7 
1  / 

R  we  In 

.0  vs.  .O 

1 

n 

r\ 

1 

18 

.8  vs. .8 

2 

1 

0 

1 

2 

1.2 

0.83 

19 

.0  vs.  .0 

1 

2 

2 

8 

8 

4.2 

3.49 

on 

.O  VS.  .O 

A 

A 

A 

A 

H.VJ 

21 

.8  vs. .8 

4 

4 

2 

3 

3 

3.2 

0.89 

22 

.0  vs.  .0 

0 

1 

2 

6 

7 

3.2 

3.11 

23 

.5  vs.  .5 

4 

4 

4 

4 

4 

4.0 

0.00 

24 

.8  vs. .8 

3 

3 

2 

3 

3 

2.8 

0.45 

25 

.5  vs. .8 

9 

9 

8 

8 

9 

8.6 

0.55 

26 

.0  vs. .5 

6 

13 

13 

20 

20 

14.4 

5.86 

27 

.0  vs.  .8 

19 

27 

29 

32 

32 

27.8 

5.36 

28 

.5  vs. .8 

8 

8 

8 

8 

8 

8.0 

0.00 

29 

.0  vs.  .5 

14 

17 

17 

19 

21 

17.6 

2.61 

30 

.0  vs. .8 

19 

26 

26 

32 

32 

27.0 

5.39 

31 

.5  vs. .8 

12 

12 

8 

12 

13 

11.4 

1.95 

32 

.0  vs. .5 

15 

21 

21 

30 

30 

23.4 

6.50 

33 

.0  vs. .8 

24 

28 

28 

31 

32 

28.6 

3.13 

Note.  Conditions  16-18  and  25  -  27  refer  to  the  conditions  with  no  between- 
group  mean  differences;  Conditions  19-21  and  28  -  30  refer  to  between-group 
mean  differences  on  Trait  1  and  Trait  2;  Conditions  22  -  24  and  31  -  33  refer  to  a 
between-group  mean  difference  on  Trait  1 . 


Table  15 

Frequency  of  Correct  Detection  of  Four  Biased  Items  in  the  Bias  Situation  in  MP 


Baseline  Replications 


Condition  rei,e2 

1 

2 

3 

4 

5 

M 

£a 

16 

.0  vs. .0 

1 

2 

2 

2 

2 

1.8 

0.45 

45% 

17 

.5  vs. .5 

2 

1 

1 

1 

2 

1.4 

0.55 

35% 

18 

.8  vs. .8 

2 

1 

1 

1 

2 

1.4 

0.55 

35% 

19 

.0  vs.  .0 

1 

2 

2 

4 

4 

2.6 

1.34 

65% 

20 

.0  VS. .0 

2 

2 

2 

2 

2 

2.0 

0.00 

50% 

21 

.8  VS.  .8 

1 

1 

1 

1 

1 

1.0 

0.00 

25% 

22 

.0  vs.  .0 

1 

1 

1 

2 

2 

1.4 

0.55 

35% 

23 

.5  vs. .5 

1 

1 

1 

1 

1 

1.0 

0.00 

25% 

24 

.8  vs.  .8 

2 

2 

2 

2 

2 

2.0 

0.00 

50% 

25 

.5  vs.  .8 

2 

2 

2 

2 

2 

2.0 

0.00 

50% 

26 

.0  vs.  .5 

2 

2 

2 

2 

2 

2.0 

0.00 

50% 

27 

.0  vs.  .8 

2 

2 

2 

4 

4 

2.8 

1.10 

70% 

28 

.5  vs. .8 

2 

1 

1 

1 

2 

1.4 

0.55 

35% 

29 

.0  vs.  .5 

2 

2 

2 

3 

3 

2.4 

0.55 

60% 

30 

.0  vs.  .8 

2 

2 

2 

3 

3 

2.4 

0.55 

60% 

31 

.5  vs. .8 

3 

3 

2 

2 

3 

2.6 

0.55 

65% 

32 

.0  vs.  .5 

2 

3 

3 

4 

4 

3.2 

0.84 

80% 

33 

.0  vs.  .8 

3 

3 

3 

3 

3 

3.0 

0.00 

75% 

Note.  Conditions  16-18  and  25  -  27  refer  to  the  conditions  with  no  between- 
group  mean  differences;  Conditions  19-21  and  28  -  30  refer  to  between-group 
mean  differences  on  Trait  1  and  Trait  2;  Conditions  22  -  24  and  31  -  33  refer  to  a 
between-group  mean  difference  on  Trait  1 . 
a  Percent  of  biased  items  detected  correctly. 


(average  of  38%)  was  about  the  same  as,  or  slightly  lower  than,  that  of  the  case 
when  the  data  were  dominantly  unidimensional  (MD  1)  with  10%  of  the  items 
being  biased  (50%).  There  was  no  tendency  for  the  true-positive  detection  rate 
to  increase  as  the  correlation  between  the  first  two  traits  increased  (45%,  35%, 
and  35%  detection  rate,  for  rei,  qi  =  0,  .5,  and  .8  respectively).  This  finding 
suggests  that  the  rate  of  correctly  identifying  items  as  biased  does  not  decline 
when  the  test  is  multidimensional  and  when  USOS  and  the  baseline  method 
are  used. 

The  results  for  Conditions  19  through  24  in  Tables  14  and  15  were  similar  to 
those  from  Conditions  16  through  18.  Although  the  number  of  false  positives 
was  slightly  elevated,  a  substantial  number  of  false  positives  was  not  observed. 
The  true-positive  detection  rates  (average  of  47%  for  Conditions  19  through  21, 
37%  for  Conditions  22  through  24)  remained  close  to  those  of  Conditions  16 
through  18.  These  results  suggest  that  there  was  no  evident  effect  of  between- 
group  mean  differences  on  the  true-positive  detection  rate  or  on  the  occurrence 
of  false  positives. 

Conditions  25  through  33  are  cases  in  which  there  was  a  between-group 
correlation  difference.  In  Conditions  28  through  33,  there  were  both  between 
group  mean  and  correlation  differences.  Here  again,  as  in  Conditions  7 
through  15  in  the  no-bias  situation,  a  strikingly  large  number  of  unbiased  items 
were  identified  as  biased.  In  the  extreme  case,  32  out  of  36  unbiased  items 
exceeded  the  criterion  value.  The  number  of  false  positives  increased  as  the 
difference  in  the  correlations  became  more  extreme.  For  example,  the  average 
values  of  false  positives  across  the  five  baselines  were  8.6,  1 4.4,  and  27.8  items 
for  rei,  e2  =  -5  vs.  .8,  0  vs.  .5,  and  0  vs.  .8,  respectively,  in  Conditions  25  through 
27.  These  results  provide  additional  support  for  the  previously  mentioned 
findings  regarding  the  effect  of  the  between-group  correlation  difference  on  the 


occurrence  of  false  positives.  Because  of  this  large  number  of  false  positives, 
the  increase  in  the  true-positive  detection  rate  (average  of  57%  across 
Conditions  25  through  27,  52%  across  Conditions  28  through  30,  and  73% 
across  Conditions  31  through  33)  is  not  interpretable. 

The  results  for  Conditions  28  through  33  also  provide  additional  support  for 
the  previously  mentioned  findings  regarding  the  combined  effect  of  between- 
group  mean  and  correlation  differences.  The  occurrence  of  false  positives  was 
exacerbated  by  the  existence  of  both  between-group  mean  and  correlation 
differences.  The  average  number  of  false  positives  across  Conditions  28 
through  30  and  Conditions  30  through  33  was  17.5  and  21.1 ,  respectively.  On 
the  other  hand,  the  average  number  of  false  positives  across  Conditions  25 
through  27  was  16.9.  The  preceding  results  given  in  the  bias  situation  suggest 
that  the  true-positive  detection  rate  does  not  appear  to  be  influenced  by  the 
degree  of  the  correlation  between  the  two  dominant  traits  or  by  the  between- 
group  mean  difference.  When  there  is  a  between-group  correlation  difference, 
the  occurrence  of  false  positives  may  be  a  more  critical  issue  than  the  true- 
positive  detection  rate. 

Baselines 

The  results  presented  in  the  preceding  two  sections  provide  information 
about  the  various  effects  of  multidimensionality  on  the  USOS  index  that  is 
based  on  the  baseline  method.  Additionally,  the  results  illuminate  some 
potential  problems  associated  with  the  baseline  method.  As  shown  in  Tables  7 
through  15  and  Figures  3  through  8,  rather  large  fluctuations  over  the  five 
replications  of  the  baselines  were  observed.  In  other  words,  the  number  of  false 
positives  or  true  positives  can  vary  to  a  great  degree  depending  on  which 
replication  of  the  baseline  condition  is  used.  Table  16  contains  a  summary  of 
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Table  16 

Summary  of  the  Criterion  Values  in  the  Baselines 


Baselines  Replication 


Condition 

1 

2 

3 

4 

5 

M 

MD  1 

%a 

B1 

5% 

2.30 

1.43 

1.68 

0.99 

1.51 

1.58 

0.47 

B2 

10% 

1.62 

1.87 

1.89 

2.05 

1.10 

1.72 

0.33 

B3 

20% 

2.11 

2.31 

2.78 

2.65 

1.83 

2.34 

0.39 

MD2 

■"Gl,  G2 

81 

.0 

3.09 

2.20 

2.14 

1.46 

1.36 

2.05 

0.70 

82 

.5 

1.44 

1.48 

1.69 

1.56 

1.41 

1.52 

0.11 

83 

.8 

1.68 

1.91 

2.34 

1.92 

1.77 

1.93 

0.25 

UDb 

2.54 

1.54 

1.56 

2.46 

1.55 

1.79 

0.54 

Note.  The  criterion  values  are  the  mean  USOS  plus  two  standard  deviations  in 
a  baseline. 

3  Percentage  of  items  that  are  multidimensional, 
b  Unidimensional  baseline  condition. 


the  criterion  values  in  the  three  types  of  baselines  in  MD  1  and  three  types  of 
baselines  in  MD  2  with  five  replications  in  each  type.  In  addition,  criterion 
values  from  unidimensional  baselines  with  five  replications  are  reported  in  the 
same  table  for  a  comparison.  The  unidimensional  baselines  were  simulated 
with  item  parameters  from  the  dominant  trait  (Trait  1)  in  MD  1  to  examine  the 
degree  of  fluctuation  in  the  unidimensional  baseline.  Among  the  six 
multidimensional  baseline  conditions,  the  condition  with  rei,  g2  =  0  ^or  MD  2 
exhibited  the  highest  mean  criterion  value  (M  =  2.05)  and  the  highest  standard 
deviation  (SD  =  .70).  However,  the  mean  or  the  standard  deviation  did  not 
decrease  systematically  as  the  correlation  between  Trait  1  and  Trait  2  became 
stronger  (M  =  1  -52,  and  1 .93,  and  SD  =  H  and  .25  for  rei,  92  =  -5  and  .8 
respectively).  Therefore,  the  evidence  that  multidimensionality  causes  a  higher 
degree  of  baseline  fluctuations  is  not  conclusive  here.  For  the  baselines  in  MD 
1 ,  the  mean  showed  a  slight  increase  as  the  percent  of  items  being 
multidimensional  increased  (M  =  1.58,  1.72.  and  2.34  for  5%,  10%,  and  20% 
multidimensionality  respectively),  and  there  was  no  systematic  pattern  in  the 
standard  deviations  (SD  =  -47,  .33,  and  .39  for  5%,  1 0%,  and  20% 
multidimensionality  respectively).  For  the  unidimensional  baseline  condition, 
the  mean  was  1.79  and  the  standard  deviation  was  .54.  These  results  suggest 
that  baselines  fluctuate  regardless  of  the  dimensionality  of  the  test.  As 
mentioned  eariier,  the  fluctuation  of  the  baselines  becomes  more  crucial  when 
the  USOS  values  in  the  experimental  condition  are  somewhat  inflated  and  the 
values  range  around  the  two  standard  deviation  above  the  mean  USOS  of  the 
baseline.  That  was  the  case  when  the  two  groups  had  a  different  correlation 
between  Trait  1  and  Trait  2  in  this  study.  The  number  of  items  identified  as 
biased  varied  considerably  depending  on  the  choice  of  the  baseline  from  the 
five  replicated  baselines.  The  results  in  this  study  suggest  a  careful  use  of  the 


baseline  method  for  the  assessment  of  item  parameter  invariance  between  two 
groups. 


CHAPTER  5 
CONCLUSIONS 


The  purpose  of  this  study  was  to  investigate  false-positive  and  tme-positive 
detection  rates  when  the  USOS  index  and  the  baseline  method  are  applied 
with  a  variety  of  types  of  multidimensional  tests.  Factors  that  were 
experimentally  manipulated  were  (a)  the  percentage  of  multidimensional  items 
when  the  data  were  dominantly  unidimensional,  (b)  the  between-group  mean 
difference  on  dominant  trait(s),  and  (c)  the  between-group  correlation 
difference  when  there  were  two  dominant  traits.  General  conclusions  and 
implications  are  presented  in  this  chapter. 

Effects  of  Between-Group  Correlation  Difference 

The  most  substantial  occurrence  of  false  positives  was  observed  when  the 
test  data  had  two  dominant  traits,  and  the  two  traits  had  different  correlations  for 
the  two  groups  under  investigation.  When  the  correlation  between  the  two  traits 
was  0  in  the  reference  group  and  .8  in  the  focal  group,  29  out  of  40  unbiased 
items  were  incorrectly  identified  as  biased  in  the  most  extreme  case.  When 
coupled  with  the  between-group  mean  difference,  the  occurrence  of  false 
positives  was  even  further  exacerbated.  In  the  most  extreme  case,  36  out  of  40 
unbiased  items  were  identified  as  biased.  This  phenomenon  was  replicated 
when  four  biased  items  were  created  in  the  40-item  test.  A  large  number  of  the 
36  unbiased  items  were  identified  as  biased  (as  many  as  32  items),  and  the 
occurrence  of  false  positives  was  more  evident  when  both  the  between-group 
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correlation  difference  and  mean  differences  existed  than  when  only  the 
between-group  correlation  difference  existed. 

In  this  study,  the  contrast  of  the  difference  in  the  correlations  had  only  three 
levels:  .5  vs.  .8,  0  vs.  .5,  and  0  vs.  .8.  Within  this  limited  number  of  contrasts,  a 
steady  increase  in  the  number  of  false  positives  was  observed  as  the 
correlation  difference  became  more  extreme.  While  it  is  difficult  to  determine 
when  a  correlation  difference  is  large  enough,  or  more  importantly,  realistic 
enough  in  a  practical  situation,  to  cause  a  substantial  number  of  false  positives, 
the  results  in  this  study  suggest  that  a  correlation  difference  is  a  potential  threat 
to  the  validity  of  the  item  bias  index.  Presumably  that  when  a  test  is 
multidimensional,  trait  correlations,  as  well  as  means  and  distributions  of  the 
traits,  may  vary  for  the  two  groups  under  investigation.  For  example.  Miller  and 
Krieshok  (in  press)  found  that  higher-order  trait  correlations  on  the  16PF  were 
different  for  males  and  females,  while  the  factor  loadings  and  error  variances 
were  the  same. 

The  between-group  correlation  difference  exhibited  another  undesirable 
effect  associated  with  the  baseline  method.  A  large  fluctuation  was  observed  in 
terms  of  the  number  of  false  positives  depending  on  the  choice  of  the  replication 
of  the  baseline.  Thus,  this  study  suggests  that  the  number  of  false  positives 
would  not  be  reliable  in  a  situation  where  a  correlation  difference  is  suspected 
between  the  two  groups.  These  findings,  in  all,  suggest  caution  in  the  use  of  the 
USOS  index  which  employs  the  baseline  method  when  a  between-group 
correlation  difference  is  suspected. 

Effects  of  Between-GrouD  Mean  Difference 

Although  the  number  of  false  positives  was  slightly  elevated  in  a  few 
conditions,  generally  the  occurrence  of  false  positives  was  not  substantial  when 
there  were  between-group  mean  differences  on  the  relevant  traits.  IRT-based 
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item  bias  indices  are  known  for  their  capability  of  not  confounding  measurement 
bias  with  between-group  differences  (Drasgow,  1987).  This  useful 
characteristic  was  shown  even  when  there  were  two  dominant  traits.  In 
addition,  the  between-group  mean  difference  did  not  appear  to  lower  the  true- 
positive  detection  rate.  On  the  other  hand,  when  there  was  a  between-group 
correlation  difference,  the  between-group  mean  difference  exacerbated  the 
occurrence  of  false  positives  as  mentioned  in  the  preceding  section.  These 
results  suggest  that,  when  there  is  no  between-group  correlation  difference,  the 
USOS  index  maintains  its  characteristic  of  no  confounding  effect  between 
measurement  bias  and  between-group  differences  even  when  the 
unidimensional  assumption  is  violated. 

Effects  of  Different  Percentages  of  Multidimensional  Items 
When  the  test  was  dominantly  unidimensional,  the  percentage  of 
multidimensional  items  affected  detection  of  truly  biased  items.  The  detection 
rate  decreased  as  the  number  of  biased  items  increased.  When  there  were 
more  than  four  biased  items,  the  detection  rate  dropped  to  around  50  %.  These 
findings  are  comparable  to  those  of  other  researchers  who  have  suggested 
iterative  procedures  for  detecting  bias  (i.e.,  estimating  bias,  removing  the  biased 
items,  reestimating  item  parameters  ,and  examining  for  item  bias  again).  Marco 
(1977)  and  Lord  (1980)  have  suggested  that  the  biased  items  affect  the  item 
parameters  estimated  in  the  first  step.  An  empirical  study  by  Miller  and  Oshima  ✓ 
(1989),  in  which  data  were  simulated  to  examine  a  two-stage  procedure  in 
estimating  item  bias,  also  suggested  that  as  the  percentage  of  biased  items 
increased,  the  ability  to  detect  biased  items  using  IRT-based  item  detection 
indices  decreased.  The  results  in  this  study,  therefore,  support  the  rational  for 
using  an  iterative  procedure  for  detecting  bias  when  a  fairly  large  number  of 
items  are  suspected  to  be  biased. 


Results  of  this  study  also  indicate  that  multidimensionality  alone,  if  the 
means,  correlations,  and  distributions  of  the  traits  for  two  groups  are  the  same, 
should  not  cause  false  positives  when  the  baseline  method  is  employed  to 
compare  the  item  parameters.  The  reason  is  apparent;  namely,  the  baseline  is 
also  multidimensional  when  the  test  is  multidimensional.  This  favorable 
characteristic  was  maintained  even  when  there  were  between-group  mean 
differences. 

Baseline  Stability 

Results  of  this  study  demonstrate  the  instability  of  the  baseline  method.  As 
shown  in  Tables  7  through  15  and  Figure  3  -  8,  the  number  of  items  Identified  as 
biased  can  vary  from  baseline  to  baseline.  The  fluctuation  of  baselines  was  not 
only  for  the  multidimensional  baselines.  The  unidimensional  baseline 
simulated  in  this  study  also  showed  as  much  fluctuation  as  the  multidimensional 
baselines.  Thus,  these  findings  certainly  provide  ground  for  questioning  the 
customary  use  of  the  baseline  method  (i.e.,  dividing  the  reference  group  into 
two  groups  only  once,  or  selecting  a  baseline  group  only  once  from  the  large 
sample  of  the  reference  group).  The  use  of  multiple  baselines  or  altering  the 
cutoff  score  to  be  more  rigorous  than  two  standard  deviations  are  possible 
alternatives  to  the  customary  use  of  the  baseline  method. 

Implications  for  Research 

The  generalizability  of  the  results  is  restricted  only  to  the  conditions  which 
were  considered  in  the  present  study.  Different  degrees  of  the  between-group 
mean  difference,  different  magnitudes  of  bias,  and  the  between-group 
distribution  difference  (i.e.,  nonnormality)  could  constitute  factors  of  interest  in 
another  study.  In  this  study,  only  three  levels  of  correlations  were  considered. 
Although  a  distinct  trend  was  observed  with  the  three  levels,  further  research  is 
desirable  to  better  delineate  the  extent  to  which  the  between-group  correlation 


difference  could  cause  false  positives. 

The  present  study  was  of  theoretical  interest  in  examining  the  effect  of  the 
violation  of  the  unidimensional  assumption  on  an  IRT-based  item  bias  index. 
Research  with  actual  multidimensional  test  data  is  now  needed  to  investigate 
practical  differences  in  trait  distributions  in  two  groups  for  which  investigation  of 
item  bias  is  desired. 

Given  these  limitations,  the  results  of  the  present  study  should  provide  useful 
information  for  developers  of  large  scale  assessment  programs  and 
psychometric  researchers  who  engage  in  item  bias  studies  with 
multidimensional  test  data.  The  study  suggests  a  cautious  use  of  the  IRT-based 
item  bias  index  (USOS)  with  the  baseline  method  when  the  test  is  suspected  to 
be  multidimensional,  because  a  correlation  difference  between  two  groups  may 
seriously  affect  the  accuracy  of  item  bias  detection. 


REFERENCES 

Ackerman,  T.  A.  (1987,  April).  The  use  of  unidimensional  item  parameter 
estimations  of  multidimensional  items  in  adaptive  testing.  Paper 
presented  at  the  annual  meeting  of  the  American  Educational  Research 
Association,  Washington,  DC. 

Ackerman,  T.  A.  (1988,  April).  An  explanation  of  differential  item  functioning  from 
a  multidimensional  perspective.  Paper  presented  at  the  annual  meeting 
of  the  American  Educational  Research  Association,  New  Orleans. 

Batley,  R.,  &  Boss,  M.  W.  (1988,  April).  The  effects  on  parameter  estimation  of 
correlated  abilities  using  a  two-dimensional,  two-parameter  logistic  item 
response  model.  Paper  presented  at  the  annual  meeting  of  the  American 
Educational  Research  Association  ,  New  Orleans. 

Bejar,  1. 1.  (1980).  A  procedure  of  investigating  the  unidimensionality  of 
achievement  tests  based  on  item  parameter  estimates.  Journal  or 
Educational  Measurement.  IZ,  283-296. 

Bock,  R.  D.,  &  Aitkin,  M.  (1981).  Marginal  maximum  likelihood  estimation  of  item 
parameters:  An  application  of  an  EM  algorithm.  Psychometrika.  4£,  443- 
459. 

Began,  E.  D.,  &  Yen,  W.  M.  (1983,  April).  Detecting  multidimensionalitv  and 
examining  its  effects  on  vertical  equating  with  the  three-parameter 
logistic  model.  Paper  presented  at  the  annual  meeting  of  the  American 
Educational  Research  Association,  Montreal. 

Carlson,  J.  E.  (1987).  Multidimensional  item  response  theorv  estimation:  A 

computer  program  (ACT  Research  Report  87-19).  Iowa  City,  lA:  American 
College  Testing  Program. 

Crocker,  L,  &  Algina,  J.  (1986).  Introduction  to  classical  and  modern  test  theon/. 
New  York:  Holt,  Rinehart,  and  Winston. 

Divgi,  D.  (1986).  Does  a  Rash  model  really  work  for  multiple  choice  items?  Not  if 
you  look  closely.  Journal  of  Educational  Measurement.  22,  283-298. 

Dorans,  N.  J.,  &  Kingston,  N.  M.  (1985).  The  effects  of  violations  of 

unidimensionality  on  the  estimation  of  item  and  ability  parameters  and  on 
item  response  theory  equating  on  the  ORE  verbal  scale.  Journal  of 

Educational  Measurement,  22, 249-262. 


72 


Drasgow,  F.  (1983,  April).  The  analysis  of  dichotomnus  test  data  using  factor 
analytic  methodology.  Paper  presented  at  the  annual  meeting  of  the 
American  Educational  Research  Association,  Montreal. 

-  Drasgow,  F.  (1987).  Study  of  measurement  bias  of  two  standardized 
psychological  tests.  Journal  of  Applied  Psychology.  22,  1 9-29. 

Drasgow,  F.,  &  Lissak,  R.  (1983).  Modified  parallel  analysis:  A  procedure  for 
examining  the  latent  dimensionality  of  dichotomously  scored  item 
responses.  Journal  of  Applied  Psychology,  gfl,  363-373. 

Drasgow,  F.,  &  Parsons,  C.  K.  (1983).  Application  of  unidimensional  item 

response  theory  models  to  multidimensional  data.  Applied  PsychQlogicai 
Measurement.  Z.  1 89-1 99. 

Hambleton,  R.  K.,  &  Murray,  L.  (1983).  Some  goodness  of  fit  investigations  for 
item  response  models.  In  R.  K.  Hambleton  (Ed.),  Applications  of  item 
response  theory  (pp.  71-94).  Vancouver:  Educational  Research  Institute 
of  British  Columbia. 

Harrison,  D.  A.  (1986).  Robustness  of  IRT  parameter  estimation  to  violation  of 
the  unidimensionality  assumption.  Journal  of  Educational  Statistics.  H, 
91-115. 

Hattie,  J.  (1985).  Methodology  review:  Assessing  unidimensionality  of  tests  and 
items.  Applied  Psychological  Measurement.  S.  139-164. 

Holland,  P.  (1981).  When  are  item  response  models  consistent  with  observed 
data?  Psychometrika.  4£,  79-92. 

Holland,  P.  (1985).  The  Mantel-Haenszel  procedure  applied  to  measuring 
differential  item  performance.  Seminar  presentations  at  Educational 
Testing  Service,  Princeton,  NJ. 

Holland,  P.,  &  Rosenbaum,  P.  (1985).  Conditional  association  and 

unidimensionality  in  monotone  latent  variable  models.  Unpublished 
manuscript. 

Hulin,  C.  L.,  Drasgow,  F.,  &  Komocar,  J.  (1982).  Applications  of  item  response 
theory  to  analysis  of  attitude  scale  translations.  Journal  of  Applied 
Psychology.  fiZ,  818-825. 

Hulin,  C.  L,  Drasgow,  F.  &  Parsons,  C.  K.  (1983).  Item  response  theory: 

Application  to  psychological  measurement.  Homewood,  IL:  Dow  Jones- 
Irvin. 


Hunter,  J.  E.  (1975,  December).  A  critical  analysis  of  the  use  of  item  means  and 
item  test  correlations  to  determine  the  presence  or  absence  of  content 
bias  in  achievement  test  items.  Paper  presented  at  the  National  Institute 
of  Education  Conference  on  Test  Bias,  Annapolis,  MD. 

4-  Ironson,  G.  H.  (1982).  Use  of  chi-square  and  latent  trait  approaches  for 
detecting  item  bias.  In  R.  A.  Berk  (Ed.),  Handbook  of  methods  for 
detecting  item  bias  (pp.  117-160).  Baltimore:  Johns  Hopkins  University 
Press. 

Ironson,  G.  H.  (1983).  Using  item  response  theory  to  measure  bias.  In  R.  K. 
Hambleton  (Ed.),  APDlication  of  item  response  theory  (dp.  155-174). 
Vancouver:  Educational  Research  Institute  of  British  Columbia. 

Kingston,  N.  M.,  &  McKinley,  R.  L.  (1988,  April).  Assessing  the  structure  of  the 
GRE  general  test  using  confirmator|/  multidimensional  item  response 
theory.  Paper  presented  at  the  annual  meeting  of  the  American 
Educational  Research  Association,  New  Orleans. 

Linn,  R.  L,  &  Harnisch,  D.  L  (1981).  Interactions  between  item  content  and 
group  membership  on  achievement  test  items.  Journal  of  Educational 
Measurement.  Ifi,  1 09-1 1 8. 

Linn,  R.  L,  Levine,  M.  V.,  Hastings,  C.  N.,  &  Wardrop,  J.  L.  (1981).  Item  bias  in  a 
test  of  reading  comprehension.  Applied  Psychological  Measurement.  5, 
159-173. 

Lord,  F.  M.  (1968).  An  analysis  of  the  Verbal  Scholastic  Aptitude  Test  using 
Birnbaum's  three-parameter  logistic  model.  Educational  and 
PsvcholoQica!  Measurement.  2S,  989-1020. 

Lord,  F.  M.  (1980).  Applications  of  item  response  theory  to  practical  testing 
problems.  Hillsdale,  NJ:  Lawrence  Eribaum. 

Lord,  F.  M.,  &  Novick,  M.  R.  (1968).  Statistical  theories  of  mental  test  scores. 
Reading,  MA.:  Addison-Wesley. 

McKinley,  R.  L,  &  Kingston,  N.  M.  (1988,  April).  Confirmatory  analysis  of  test 
structure  using  multidimensional  IRT.  Paper  presented  at  the  annual 
meeting  of  the  National  Council  of  Measurement  in  Education,  New 
Orleans. 

McKinley,  R.  L.,  &  Reckase,  M.  D.  (1983).  MAXLOG:  A  computer  program  for  the 
estimation  of  the  parameters  of  a  multidimensional  logistic  model. 
Behavior  Research  and  Instrumentation.  15,  389-390. 

McLaughlin,  M.  E.  (1986).  Computing  Lord's  item  bias  statistic  when  abilities 
and  item  parameters  are  estimated  simultaneouslv.  Unpublished 


master's  thesis,  Department  of  Psychology,  University  of  Illinois  at 
Urbana-Champaign. 

Mantel,  N.,  &  Haenszel,  W.  (1959).  Statistical  aspects  of  the  analysis  of  data 
from  retrospective  studies  of  disease.  Journal  of  the  National  Cancer 
Institute.  22,  719-748. 

Marco,  G.L  (1977).  Item  characteristic  curve  solutions  to  three  intractable 
testing  problems.  Journal  of  Educational  Measurement.  15.  139-160. 

Miller,  M.  D.,  &  Krieshok,  T.  S.  (in  press).  Gender  differences  in  the  second- 
order  factor  structure  of  the  16PF.  Measurement  and  Evaluation  in 
Counseling  and  Development. 

Miller,  M.  D.,  &  Linn,  R.  L.  (1988).  Invariance  of  item  characteristic  functions  with 
variations  in  instructional  coverage.  Journal  of  Educational 
Measurement.  2^,  205-219. 

Miller,  M.  D.,  &  Oshima,  T.  (1989,  March).  An  investigation  of  a  two  stage 

procedure  for  detecting  item  bias.  Paper  presented  at  the  annual  meeting 
of  the  National  Council  on  Measurement  in  Education,  San  Francisco. 

Mislevy,  R.  J.,  &  Bock,  R.  D.  (1985).  PC-BILQG  Version  1.1 :  Maximum  likelihood 
item  analysis  and  test  scoring:  Logistic  model.  Mooresville,  IN:  Scientific 
Software,  Inc. 

Muthen,  8.  (1987).  Using  item-specific  instructional  information  in  achievement 
modeling.  Unpublished  manuscript. 

Muthen,  B.  O.,  Kao,  C,  &  Burstein,  L.  (1988,  April).  Instructional  sensitivity  in 
mathematics  achievement  test  items:  Application  of  a  new  IRT-based 
detection  technique.  Paper  presented  at  the  annual  meeting  of  the 
American  Educational  Research  Association,  New  Orleans. 

Reckase,  M.  D.  (1979).  Unifactor  latent  trait  models  applied  to  multifactor  tests: 
Results  and  implications.  Journal  of  Educational  Statistics.  ±,  207-230. 

Reckase,  M.  D.  (1985,  April).  The  difficultv  of  test  items  that  measure  more  than 
one  ability.  Paper  presented  at  the  annual  meeting  of  the  American 
Educational  Research  Association,  Chicago. 

Reckase,  M.  D.  (1986,  April).  The  discriminating  power  of  items  that  measure 
more  than  one  dimension.  Paper  presented  at  the  annual  meeting  of  the 
American  Educational  Research  Association,  San  Francisco. 

Reckase,  M.  D.,  Ackerman,  T.  A.,  &  Carlson,  J.  E.  (1988).  Building  a 

unidimensional  test  using  multidimensional  items.  Journal  of  Educational 
Measurement.  25,  193-203. 


76 


Reckase,  M.  D.,  &  McKinley,  R.  L.  (1983).  Some  latent  trait  theory  in  a 

multidimensional  latent  space.  In  D.  J.  Weiss  (Ed),  Proceedings  of  the 
1982  Item  Response  Theory/Computer  Adaptive  Testing  Conference  (pp. 
151-177).  Minneapolis:  University  of  Minnesota,  Department  of 
Psychology. 

Rosenbaum,  P.  (1984).  Testing  the  conditional  independence  and  monotonicity 
assumptions  of  item  response  theory.  Psychometrika.  42,  425-436. 

Ross,  J.  (1966).  An  empirical  study  of  a  logistic  mental  test  model. 
Psychometrika.  21,  325-340. 

Rudner,  L.  M.  (1977).  An  approach  to  biased  item  identification  using  latent  trait 
measurement  theorv.  Paper  presented  at  the  annual  meeting  of  the 
American  Educational  Research  Association,  New  York. 

Samejima,  F.  (1 974).  Normal  ogive  model  on  the  continuous  response  level  in 
the  multidimensional  latent  space.  Psvchometrika.  22.  111-121. 

SAS  Institute.  (1985).  SAS  User's  Guides:  Basic  and  Statistics:  Version  5 
Carey,  NC:  SAS  Institute,  Inc. 

Scheuneman,  J.  (1979).  A  new  method  for  assessing  bias  in  test  items.  Journal 
of  Educational  Measurement.  1£.  143-152. 

Shepard,  L.  A.,  Camilli,  G.,  &  Averill,  M.  (1981).  Comparison  of  procedures  for 
detecting  test-item  bias  with  both  internal  and  external  ability  criteria. 
Journal  of  Educational  Statistics,  fi,  317-375. 

Shepard,  L.  A.,  Camilli,  G.,  &  Williams,  D.  M.  (1985).  Validity  of  approximation 
techniques  for  detecting  item  bias.  Journal  of  Educational  Measurement. 
22,  77-105. 

Stout,  W.  (1987).  A  nonparametric  approach  for  assessing  latent  trait 
dimensionalitv.  Unpublished  manuscript. 

Sympson,  J.  B.  (1978).  A  model  for  testing  with  multidimensional  items.  In  D.  J. 
Weiss  (Ed.),  Proceedings  of  the  1977  Computer  Adaptive  Testing 
Conference  (pp.  82-98).  Minneapolis:  University  of  Minnesota, 
Department  of  Psychology. 

Thissen,  D.,  &  Steinberg,  L.  (1984).  A  response  model  for  multiple  choice  items. 
Psychometrika.  42,  501-519. 

Traub,  R.  E.  (1983).  A  priori  considerations  in  choosing  an  item  response 

model.  In  R.  K.  Hambleton  (Ed.),  Applications  of  item  response  theory 
(pp.  57-70).  Vancouver:  Educational  Research  Institute  of  British 
Columbia. 


77 


Vojir,  C.  P.,  &  Shepard,  L.  A.  (1988).  Effect  of  instructional  history  on  item 

response  theory  estimates  derived  from  a  mathematics  achievement  test. 
Unpublished  manuscript. 

Whitely,  S.  E.  (1980).  Multicomponent  latent  trait  models  for  ability  tests. 
Psvchometrika.  4S,  479-494. 

-  Wilson,  D.,  Wood,  R.  L.  &  Gibbons,  R.  (1984).  TESTFACT:  Test  scoring  and  item 
factor  analysis.  [Computer  program].  Moorsville,  IN:  Scientific  Software, 
Inc. 

Wright,  B.,  Mead,  R.,  &  Draba,  R.  (1976).  Detecting  and  con-ectino  test  item  bias 
with  a  logistic  response  model  (Research  Memorandum  No.  22). 
Chicago:  University  of  Chicago,  Department  of  Education. 

Wright,  B.,  &  Stone,  M.  (1979).  Best  test  design.  Chicago:  MESA  Press. 

Yen,  W.  M.  (1982,  March).  Use  of  three-parameter  item  response  theory  in  the 
development  of  CTBS.  Form  U.  and  TCS.  Paper  presented  at  the  annual 
meeting  of  the  National  Council  on  Measurement  in  Education,  New 
York. 

Yen,  W.  M.  (1984).  Effects  of  local  item  dependence  on  the  fit  and  equating 
performance  of  the  three-parameter  logistic  model.  Applied 
PsvchQiogicai  Measurement.  8.  125-145. 


BIOGRAPHICAL  SKETCH 


Takako  Oshima  was  born  March  29, 1960,  in  Osaka,  Japan.  She  received 
the  Bachelor  of  Arts  degree  from  Osaka  Women's  University  in  March,  1982.  In 
August,  1984,  she  received  the  Master  of  Arts  degree  in  linguistics  from 
Southern  Illinois  University,  Carbondale.  She  began  her  doctoral  program  at 
the  University  of  Florida  in  the  fall  of  1984.  While  in  graduate  school,  she  he\6  a 
variety  of  assistantships.  From  1984  to  1988,  she  taught  first-  and  second-year 
Japanese  in  the  Department  of  African  and  Asian  Languages  and  Literatures. 
From  1988  to  1989,  she  taught  Japanese  at  the  P.  K.  Yonge  Laboratory  School 
in  the  College  of  Education.  Concurrently,  she  worked  as  a  research  and 
teaching  assistant  in  the  Department  of  Foundations  of  Education.  One  of  her 
duties  was  to  teach  the  introductory  educational  measurement  and  evaluation 
course.  She  has  recently  accepted  a  position  as  Assistant  Professor  of 
Educational  Research  at  Georgia  State  University  in  Atlanta. 


78 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope 
and  quality,  as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Linda  M.  Crocker,  Chair 
Professor  of  Foundations  of 
Education 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope 
and  quality,  as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Janies  J.  Algiri^,  Cochair 
Professor  of  Foundation^  of 
Education 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope 
and  quality,  as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


M.  David  Miller 
Assistant  Professor  of 
Foundations  of  Education 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope 
and  quality,  as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


"Michael 
Professor 
Leadership 


y.  Nunnery  / 
)r  of  Educational  / 


This  dissertation  was  submitted  to  the  Graduate  Faculty  of  the  College  of 
Education  and  to  the  Graduate  School  and  was  accepted  as  partial  fulfillment  of 
the  requirements  for  the  degree  of  Doctor  of  Philosophy. 


May,  1989  V^v^-^^  \  .QMaaa^ 

Chairman,  Foun^^tions/6f 
Education  J 


Dean,  College  of  Education 


Dean,  Graduate  School 


